Problem Set Pollution Reductions from Renewable Energies: Social Benefits and Policy Implications

Author: Anna Sophie Barann

< ignore

library(RTutor)
library(yaml)
setwd("C:/Users/Sophie/ownCloud/Masterarbeit/PSPollutionReductions")
ps.name = "Pollution Reductions from Renewable Energies"; sol.file = paste0(ps.name,"_sol.Rmd")
libs = c("dplyr", "ggplot2", "lubridate", "reshape2", "stargazer", "lfe", "sandwich", "lemon", "microbenchmark", "GenKern", "gridExtra") # character vector of all packages you load in the problem set
name.rmd.chunks(sol.file) # set auto chunk names in this file
create.ps(sol.file=sol.file, ps.name=ps.name, user.name=NULL,
libs=libs, extra.code.file="functions.R", var.txt.file="variable_description.txt", addons="quiz")
# When you want to solve in the browser
# show.ps(ps.name,launch.browser=TRUE, load.sav=FALSE,
# sample.solution=FALSE, is.solved=FALSE)

>

Exercise Overview

Welcome to the problem set "Pollution Reductions from Renewable Energies: Social Benefits and Policy Implications"!

This problem set was developed as part of my Master's thesis at Ulm University. You will study the impact of renewable energies in Texas on pollution from fossil fuel power plants using the programming language R. In doing so you will not only learn about renewable energies and the energy sector, but also econometrics and programming in R.
The problem set was written using the R package RTutor. It is interactive, which means that you will be asked to enter code yourself. Your code is then checked for correctness and completeness. If it is correct, you will see the results of your computations, which can then be analysed and interpreted. But don't worry, you will learn everything step by step.

The problem set is based on the paper "Valuing the Wind: Renewable Energy Policies and Air Pollution Avoided" by Kevin Novan (Novan (2015a)). You can access the paper as well as Novan's original dataset and code at Novan (2015b). We will reproduce Novan's results, giving you the opportunity to experience first-hand how the analysis for an econometric paper is done and allowing you to better understand the results and its implications. This will also be useful when reading other econometric papers or doing you own research in the future.
At the end of the problem set we will be able to discuss the implications of our results for policies trying to promote renewable energies.

Note: Since I will refer to Novan (2015a) quite often throughout the problem set, I will usually go without the proper citation. Instead, I will refer to it just as "Novan's paper" and to the author as "Novan" without stating the year of publication. Just know in these cases, I am referring to this paper.

Content

The problem set consists of the following exercises:

$\qquad$ Overview

$\qquad$ 1 Introduction
$\qquad \qquad$ 1.1 Background
$\qquad \qquad$ 1.2 Exploring Wind Energy
$\qquad \qquad$ 1.3 The Dataset

$\qquad$ 2 Building the Model
$\qquad \qquad$ 2.1 A Simple Model for Estimating Average Emissions Offset by Wind Production
$\qquad \qquad$ 2.2 The Classical Linear Regression Model
$\qquad \qquad$ 2.3 Extending the Model
$\qquad \qquad$ 2.4 Seasonality
$\qquad \qquad$ 2.5 Autocorrelation and Newey-West Standard Errors
$\qquad \qquad$ 2.6 Average Emissions Offset by Wind Production

$\qquad$ 3 Marginal Emissions Avoided
$\qquad \qquad$ 3.1 The Model
$\qquad \qquad$ 3.2 Heterogeneity and its Driving Factors

$\qquad$ Excursus: Loops vs. Vectorisation

$\qquad$ 4 Average Emissions Avoided by Wind Capacity Additions

$\qquad$ 5 Social Benefits from Wind Energy

$\qquad$ 6 Solar Power
$\qquad \qquad$ 6.1 Characteristics of Solar Energy
$\qquad \qquad$ 6.2 Social Benefits from Solar Energy

$\qquad$ 7 Implications for Renewable Energy Policies

$\qquad$ 8 Summary

$\qquad$ Bibliography

How to solve the interactive problem set

This problem set consists of several exercises, which build upon each other, but can be worked on individually if you are only interested in certain parts. At various points throughout the problem set you will find a note that you may skip the next part, allowing you to adapt the problem set to your prior knowledge in R and econometrics. You can select an exercise by clicking on the respective tab at the top of the page.

The problem set contains code chunks in which you are asked to enter R code. To do so, press edit above the chunk, enter your solution and then press check. The code chunk will be run and checked and if your answer is correct you can proceed. If it is not correct you can change your code and try again. You can get help by pressing hint. If you cannot solve the task or if you are only interested in the results, then you can click on solution which will automatically reveal the sample solution. By clicking data you can access the Data Explorer where you can look at the dataset and check the variable descriptions. The button run chunk only runs the code entered in the code chunk without checking whether it fits the sample solution.
Usually you have to solve each code chunk correctly before you can proceed to the next one. A correct code chunk will sometimes produce a table or a plot which can then be interpreted.
By successfully completing tasks throughout the problem set you will earn awards. You can check your progress in the tab to the right of the Data Explorer in your browser.

You will also come across quizzes in this problem set. These do not have to be solved in order to continue with the exercise, but they test what you learned up to that point and sometimes provide additional information.
In order to cater to different levels of understanding of both coding in R and econometrics, there are info boxes. They are not crucial to solving or understanding the exercises, but give more detailed descriptions of R functions and packages or additional background information for those interested.

Just like any scientific publication, this problem set references other papers, textbooks or online resources. All references are listed in the tab Bibliography. Here you also find information on all R packages we will use throughout the problem set and a link to where you can download their manuals.

Let's get started!

Click on Go to next exercise... or choose the tab for exercise 1.1 above to start with the problem set. Good luck and have fun!

Exercise 1.1 -- Introduction: Background

In his paper "Valuing the Wind: Renewable Energy Policies and Air Pollution Avoided", Novan uses hourly variations in energy production of wind turbines in Texas to determine how wind power influences pollution from fossil fuel power plants. His aim is to draw conclusions from this on how renewable energy policies should be designed in order for them to offer optimal social benefits. Before diving into the analysis and reproducing Novan's findings step by step, let's discuss some of the basics we need to understand the rest of the problem set.

The Electrical Grid
Electricity is produced in power plants, which can come in various forms: There are for example power plants fuelled by fossil fuels (like coal, gas or oil), nuclear power plants or power plants producing renewable energy from the sun or wind. The produced electricity is transported to the consumers via transmission lines. A power system broadly consists of power plants, a transmission network and consumers. It is referred to as a power grid (Zhu (2015, chapter 1.1)). Novan points out that for a grid to remain stable, the energy supplied has to equal the demand for energy in the grid at any time. This equilibrium of supply and demand is the basis for the analysis in this problem set. We rely on the fact that, when wind turbines generate any quantity of electricity, other power stations have to reduce their production by the same quantity in the short run in order to meet the consumers' demand. If the power plants which reduce their output are fossil fuel powered, less pollutants are emitted and wind power offsets emissions.

The Merit Order and Electricity Pricing
The displacement of non-renewable generation through wind energy can be illustrated by the merit order. It models the order, in which power plants generate electricity to meet the current demand (Schellong (2016, chapter 9.4.1)). The units are sorted by increasing marginal costs, which are then plotted dependent on the cumulative load. The idea is that the power plants with the lowest marginal costs produce first, followed by units with successively higher marginal costs until the current demand is met.
The two figures below show the schematic merit order for the Texan power plants. In the upper figure the energy mix does not contain renewable energies. The basic level of demand is covered by the generators with the lowest marginal costs. These are called base load power plants and most of the time they will produce at their maximum capacity (Hau (2014)). According to the merit order below, these are nuclear power plants in Texas. With increasing demand, additional power plants with higher variable costs begin to produce. The ones with the highest variable costs will only produce at peak levels of demand and are thus called peak load power plants. It is necessary that these plants can be ramped up and down very quickly to meet the current demand (Hau (2014)). According to the figure below, the peak load in Texas is mostly covered by simple gas turbines.
The last generating unit also sets the exchange price for electricity, which is determined by the point at which the curves for demand and marginal costs intersect (Schellong (2016, chapter 9.4.1)).
The marginal costs are mainly driven by fuel costs (Schellong (2016, chapter 9.4.1)). Therefore wind turbines and other renewable technologies have very low marginal costs and in general, produce whenever the respective conditions allow it. The second figure below shows the effect of renewable energy on the production of conventional power plants. The whole graph is shifted to the right, which means that for a given level of demand, the conventional power plants with the highest marginal costs stop or reduce their production in response to renewable energy until production and demand are equal again. In the example below, renewable energies replaced generation from coal and combined cycle (CC) gas units. Here emissions were likely reduced. The shift to the right also means that the wholesale price for electricity falls, which is referred to as the merit order effect of renewable energies (Schellong (2016, chapter 9.4.1)). As Novan points out, this does in general not affect short-term demand because retail prices usually change within a longer time horizon.
It is important to note that the merit order can only describe short-time effects of renewable energy. An increasing share of renewable technologies in power production might change the composition of installed power plants in a market in the long run (Schellong (2016, chapter 9.4.1)).

Merit Order Conventional Only
Merit Order With Renewables
(Source: Own representation based on Rhodes et al. (2017) and Schellong (2016, chapter 9.4.1))

Pollution from Power Plants
Power plants emit a range of different pollutants including greenhouse gases (like CO$_2$, CH$_4$ and N$_2$ O), SO$_2$, NO$_x$ or small particles like ash or dust (Dincer & Zamfirescu (2014, chapter 2.3)). Due to the availability of the data, Novan focuses on three of those pollutants in his paper: Carbon dioxide (CO$_2$), nitrogen oxides (NO and NO$_2$, together referred to as NO$_x$) and sulphur dioxide (SO$_2$).
CO$_2$ is a so-called greenhouse gas. These gases are responsible for the greenhouse effect and therefore contribute to global warming. According to Dincer & Zamfirescu (2014, chapter 2.3), greenhouse gases collect in the earth's atmosphere and absorb some of the surface's infrared radiation. They are a natural part of the earth's atmosphere, but man-made emissions increase the amount of greenhouse gases and upset the balance between the greenhouse effect and the cooling albedo effect. This increases the surface temperature of the earth.
NO$_x$ and SO$_2$ have a negative impact on the environment and the health of humans and animals. They can cause respiratory problems and lead to acidification of both soil and water (Dincer & Zamfirescu (2014, chapter 2.3)).

The Texan Grid
In his paper, Novan explores the effects of wind energy produced from wind turbines connected to the ERCOT Interconnection in Texas, which covers about $75\%$ of Texas and provides about $90\%$ of the Texan electricity (ERCOT (2017b)). It is managed by the Electric Reliability Council of Texas (ERCOT). The map below shows the extension of grids in Texas and the US. We will see why the ERCOT grid is especially well suited for our analysis in the next exercise.

Power Grids in the US
(Source: ERCOT (2017c))

This exercise laid the groundwork for understanding the problem set. In the next part of the introduction, we will explore the characteristics of wind energy.

Exercise 1.2 -- Introduction: Exploring Wind Energy

Most of the estimations in the following exercises will be done for wind power. Before starting to work with the data and replicate Novan's analysis, let us take a look at the variation in output from wind turbines. If you have no experience with R it is advisable to first read through the info box below which explains the basics of programming in R.

< info "Basics of Programming in R"

R is a free programming language and environment used for data analysis, statistics and graphics. It is popular because of its adaptability to new developments and the users' individual needs. For more information on R and its contributors or for downloads and more, see https://www.r-project.org/ and https://cran.r-project.org/.
A standard R installation comes with the R base-packages. They include several useful packages which contain a great number of frequently used functions. Beyond that there are numerous packages for various different applications which can be downloaded for free as well and extend your installation of R.
In R, a function is called with the function name followed by brackets. In these, one usually specifies the objects on which the function is applied as well as additional arguments. Venables et al. (2017) gives a good overview over the R base functions in key topics.
Some basic arithmetic commands in R are shown below for those of you who have no experience whatsoever in programming.

# Assigning a value to a variable
x = 5
# Addition
5 + 3
# Subtraction
10 - 7
# Multiplication
2 * 8
# Division
12 / 6
# Raising to the power of 2
4 ^ 2
# Extracting a root
sqrt(9)

Any other functions we need are introduced during the problem set. If you are interested in thoroughly learning the basics of R, you will find a good introduction in Davies (2016).

>

The data we need is saved in the file data_wind.rds. All datasets for this problem set are in the .rds-format. These files can be opened in R with the function readRDS(). The name of the dataset (including the quotation marks) has to be specified in the brackets. By assigning a name to the loaded dataset, we save it as a data frame which allows us to easily access the data. If you are not familiar with datasets and how they are stored in R, take a look at the next info box. Below you see the first code chunk. Here, enter the code to load the dataset and save it as data_wind. The basic structure of the command is already there, just uncomment the line (i.e. remove the # at the beginning) and replace the .... Don't forget to click check after entering your answer and remember that you can get help by clicking on hint.

#< task
# Read in the dataset 'data_wind.rds' and save it as data_wind.

# The command should have the following structure:  
# ... = readRDS("...")
#>

data_wind = readRDS("data_wind.rds")

< award "R-prentice"

Congratulations, you solved the first code chunk!

>

< info "Introduction to Datasets in R"

In R, datasets can be conveniently stored in a data frame. A data frame is an object with rows and columns of the same length and can contain various types of data, like numbers, character strings or logical variables. You can think of data frames as matrices, the columns of which can contain data of different types (they do not all have to be numeric values). If you want to know more about the characteristics of a data frame, see Venables et al. (2017, chapter 6.3).
Each column represents one variable in our dataset while each row is one observation. We are using time series data in this problem set, therefore each observation corresponds to a certain point in time and the observations are ordered according to time.

>

Let's take a look at the dataset and see what is stored in it. A useful function when starting to work with a new dataset in R is sample_n() from the package dplyr. This function shows $n$ random rows from a table. In the brackets you need to specify our dataset. The argument size = ... allows you to choose how many rows are displayed. Before we can use a package for the first time in an exercise, we have to load it. A package is loaded using the function library() with the name of the package in brackets. If you want to learn more about dplyr, check the info box below. Load the package dplyr and display ten random lines from data_wind. Again, replace the ... in the code chunk.

#< task
# First, load the package 'dplyr'.

# Then display ten random rows from 'data_wind'.

# Your answer should look like this:
# library(...)
# sample_n(..., size=...)
#>

library(dplyr)
sample_n(data_wind, size=10)

< award "Sampler"

Well done, you displayed random rows from a data frame using sample_n().

>

< info "Manipulating Data with 'dplyr': Introduction and 'sample_n()'"

The package dplyr provides very useful functions to manipulate data frames. You can filter rows fulfilling certain criteria, store results from computations on the data in new columns, select specific columns for a new data frame or perform computations on subsets of the data, among many other things.
Here we use the function sample_n() to display several randomly chosen rows from our dataset. We will use other functions from dplyr in the course of this problem set.

>

The dataset contains hourly data on power production from wind turbines in Texas from the year 2011. The data is part of Novan's original dataset, which is available at Novan (2015b). The column datetime contains each observation's date (in year-month-day format) followed by the time. (Please note that the time does not denote the specific point in time, but rather the whole hour following that given point in time. So, 2011-01-02 00:00:00 specifies the first hour of January 2, 2011. To avoid confusion, the column hour gives the number of the hour.) The second and third column, month and day, specify the month and the day in the month, respectively. Finally, column wind contains the total hourly generation from ERCOT wind turbines (in $\text{MWh}$) and is provided by ERCOT itself (cf. section II in Novan's paper).

First, let's take a look at how the hourly output from wind turbines changes over one month, exemplary for October 2011. The required code is already provided in the next code chunk. The first command creates a new data frame called data_wind_oct which is a subset of data_wind containing observations from October 2011 only. It uses the function filter() from dplyr, which filters all rows with the specified characteristics. We do not have to load dplyr again in this chunk because we already used it in this exercise. The second line of code loads the package ggplot2 which we will use for plotting. If you are interested, you can read more on ggplot2 in the info box below. The last command creates a plot of the hourly wind output against time. Just click on check to run the code.

#< task_notest
# Create a subset of the data only containing the data for October.
data_wind_oct = filter(data_wind, month==10)

# Load the package ggplot2.
library(ggplot2)

# Plot 'wind' against 'datetime'.
# The first line initialises the plot and defines the dataset and the data
# for the x- and y-axis...
ggplot(data_wind_oct, aes(x=datetime, y=wind)) +

  # ...then a layer connecting all observations with a line is added...
  geom_line() +

  # ... and finally the axes labels, the plot title
  # and the scale of the x-axis are set.
  xlab("Hour") +
  ylab("Hourly ERCOT Wind Generation [MWh]") +
  ggtitle("Variation of Hourly Output From Wind Turbines in Texas") +
  theme(plot.title = element_text(hjust = 0.5)) +
  scale_x_datetime(date_breaks = "2 days", date_labels = "%b %d")
#>

< info "Plotting with 'ggplot2'"

The package ggplot2 is an elegant and flexible system for creating plots in R. It allows the user to build plots by successively adding layers to the same object. Though the syntax might take a little getting used to, ggplot2 proves to be very powerful once you learn how to use it.
A new plot is initialised by the function ggplot(). Here you can specify the data frame and the plot aesthetics which are to be the same in all layers. Aesthetics describe how the data will be translated into visual characteristics and are set with the function aes(). In the code chunk above we define data_wind_oct as the input data frame and, in aes(), specify that column datetime is on the x-axis and wind on the y-axis.
Each new layer is added with a +. In this case, we want to create a plot connecting all observations by a line which we add by geom_line(). Since we already specified everything necessary in ggplot, we do not have to define anything in the brackets. The next three layers in the code chunk above set the axes labels and title using ggtitle(), xlab() and ylab(). The last two layers centre the title and modify the scale on the x-axis.

>

As you can see, the output from wind turbines varies considerably over time, even from one hour to the next. In some periods over the month, very little wind energy was produced in Texas, while in peak hours there was well over $7000\,\text{MWh}$ produced. This coincides with our personal experience that wind, in terms of both wind speed and direction, can fluctuate substantially even over short periods of time. For the electricity generation from wind turbines, wind speed is especially important (Wan (2005)). The graph also shows that wind output tends to peak during the night.

Now let's take a look at the variations in average wind production over all hours for each day in 2011. The following code chunk first calculates the average wind energy production for each day in 2011 and saves it in the separate column wind_avg in the new data frame data_wind_daily_1. To do so, we use the pipe operator %>% and the functions group_by(), ungroup() and summarise() from the package dplyr (see the info box below for more information). Then ten lines from data_wind_daily_1 are shown. Just press check.

#< task
# This command computes the daily average output over all hours for
# each day in 2011 and saves it as 'data_wind_daily_1'.

# First, the dataset is specified...
data_wind_daily_1 = data_wind %>%
  # ...then the observations are grouped by 'month' and 'day'...
  group_by(month, day) %>%
  # ...now the mean of 'wind' is computed and saved in a new column
  # called 'wind_avg'...
  summarise(wind_avg = mean(wind)) %>%
  # ...finally, the grouping is removed.
  ungroup()

# This line shows ten random lines from the new data frame data_wind_daily_1.
sample_n(data_wind_daily_1, size=10)
#>

< info "Manipulating Data with 'dplyr': '%>%', 'group_by()', 'summarise()' and 'mutate()'"

%>%: The pipe operator enables executing several functions, one after the other, in an elegant way. The result from one function is used as the input for the next function without having to define new data frames or nest several functions. The data frame which is to be used is only defined once in the first line and does not have to be specified again in the successive commands. The commands are connected with the pipe operator.

group_by(): Groups a table according to the specified variables. Any operation on the data thereafter will be performed on each group. Here we group the data frame by month and day in order to be able to average wind production over all hours in each day in 2011. ungroup() dissolves the grouping.

summarise(): Summarises the data according to the specification in brackets. Here we take the mean of wind. Since we grouped our data before, summarise() will be performed on each group individually. The result is saved in a new column, which we call wind_avg. The final data frame is saved as data_wind_daily_1.

mutate(): Adds one or more new variables to an existing data frame. Other variables in the same data frame can be used to compute the new variables. We use it below to add a column date to data_wind_daily_1 which is computed using the existing columns month and day.

>

As you can see, after summarising the data the new data frame only has columns for the two variables we grouped by, month and day, as well as the new column wind_avg. And for each day we now only have one row.
The next chunk will produce a graph visualising the results. The necessary commands are already there. Since summarise() dropped all columns except month, day and wind_avg and we want to depict the temporal progress of wind production, we need to create a new column date before we can plot. This is done using mutate() from dplyr (see info box above) and make_date() from the package lubridate. Further information on the latter package is found in the info box below. Just press check.

#< task_notest
# Load the package 'lubridate'.
library(lubridate)

# Add a column 'date' to 'data_wind_daily_1' and store it in
# 'data_wind_daily_2'.
data_wind_daily_2 = mutate(data_wind_daily_1,
                         date=make_date(year=2011, month=month, day=day))

# Create the plot.
ggplot(data_wind_daily_2, aes(x=date, y=wind_avg)) +
  geom_line() +
  xlab("Day") +
  ylab("Daily Average ERCOT Wind Generation [MWh]") +
  ggtitle("Variation of Daily Average Output From Wind Turbines in Texas") +
  theme(plot.title = element_text(hjust = 0.5)) +
  scale_x_date(date_breaks = "month", date_labels = "%b %d", minor_breaks = NULL)
#>

< info "Managing Time with 'lubridate'"

The package lubridate contains many useful functions for working with dates, times and time spans. It allows you, among others, to extract information from a timestamp, for example the month, day of the week or whether a given year was a leap year, or convert an object into a date-time or timespan. Here we use make_date() in order to create a date from month and day and the additional information that the year is 2011.

>

We see substantial variations in ERCOT wind output over the whole year as well. During summer the produced wind energy tends to be lowest.

The two graphs show that output from wind turbines varies considerably both long- and short-term. This is the reason why the following analysis can be done in the first place. Novan uses the hourly fluctuations in output from Texan wind turbines to estimate the short-run marginal impact of wind energy on pollution from conventional power plants. Ultimately, he wants to find out what effect one additional $\text{MW}$ of installed wind capacity has on pollution at different capacity levels. To get reliable results there has to be substantial variation in the hourly wind production. The next info box explains why Novan chose Texas for his analysis.

< info "Why Texas?"

You might wonder why Novan has chosen data from Texas for his analysis. What advantages does the Texan grid have compared to others? As we already learned in exercise 1.1, most consumers in Texas are connected to the ERCOT Interconnection, a grid that extends over most of Texas and only Texas. This is unique in the USA, where most grids extend over a large area, up to a third of the whole country (see again the map in the last exercise). There is little connection between the ERCOT grid and the surrounding grids. According to ERCOT (2017a), there are $430\,\text{MW}$ of transmission capacity between the ERCOT grid and the Mexican grid and $820\,\text{MW}$ with the Southwest Power Pool (SPP) grid in Oklahoma. This isolation simplifies the problem because it means that in order to estimate the effects of wind energy in Texas (more precisely ERCOT wind energy) on pollution, we need only to consider fossil fuel power plants in these regions. However, since Novan does not have access to data on emissions from Mexican power plants, he only considers effects on SPP power plants, as we will see in exercise 2.3.
Another advantage is that the ERCOT grid has a large amount of installed wind capacity. In fact, Texas has the most installed wind capacity in all of the US since it overtook California in 2006 (WINDExchange (2017)). A large installed capacity implies large variation in output, which is necessary in order to get reliable estimates. Novan also points out that there is very little hydroelectric capacity (we will see that below). This is important because hydropower is stored as potential energy. If part of the output from wind turbines is stored in pumped storage hydro power stations, conventional power plants have a delayed response to the wind energy produced at a certain point in time. Large quantities of hydropower in the energy mix would therefore distort the results from the estimations.

>

Next, let's take a look at the composition of ERCOT electricity. Load the dataset data_mix.rds, store it in data_mix and show the data frame. You can do the latter by simply entering data_mix.

#< task
# Load 'data_mix.rds' and save it as 'data_mix'. Then show 'data_mix'.
#>

data_mix = readRDS("data_mix.rds")
data_mix

Stored in the data frame is the percentage share of average hourly net ERCOT production in total electricity generation for each year and fuel type (in $\text{MWh}$). Fuel type other includes biomass, landfill gas, other fossil fuels apart from gas and coal, and solar. The data this summary is based on is taken from Novan's original dataset and was provided by ERCOT.
Let's visualise the ERCOT energy mix with a graph for each fuel type. To do so, we first have to reshape data_mix such that all shares are in one column and the fuel type is in a separate column. For that we use melt() from the package reshape2. The structure we need for this command is ... = melt(..., id.vars="...", variable.name="fueltype", value.name="share"). id.vars=... specifies the columns which are kept, in our case year. The arguments variable.name and value.name set the names for the column containing the information on the fuel type and shares, respectively. If you are interested in the package reshape2, check the info box below. In the code chunk below, first load the package. Then adapt the provided line of code to melt data_mix and store the new data frame as data_mix_melt. Finally, show the first few lines of the new data frame using the command head().

#< task
# Load 'reshape2' and melt 'data_mix' as described above.
# Store the result in 'data_mix_melt'.

# The structure for the command should look as follows:
# ... = melt(..., id.vars="...", variable.name="fueltype", value.name="share")

# Then show the first few lines of the new data frame.
#>

library(reshape2)
data_mix_melt = melt(data_mix, id.vars="year",
                     variable.name="fueltype", value.name="share")
head(data_mix_melt)

< award "Melter Lv.1"

Congratulations, you correctly adapted the provided code to melt a data frame!

>

< info "Reshaping Data with 'reshape2'"

The package reshape2 comprises functions that are useful for transforming data from wide to long format and vice versa. This is sometimes necessary because some packages require data to be in long format while for other applications the wide format is more convenient. The core functions in reshape2 are melt()-functions, which melt wide-format data into long-format data, and the cast()-functions which turn data from long format into wide format.

>

We now have a data frame with only three columns, but with 30 rows. In other words, it was transformed from a wide into a long format. The data has effectively been arranged one underneath the other and column fueltype now contains the information on the fuel type for each observation. With this we can create the plots. Just click check.

#< task_notest
# Create a plot for each fuel type.
ggplot(data_mix_melt, aes(year, share, color=fueltype)) +
  geom_line(size=1.5) +
  xlab("Year") +
  ylab("Average Share") +
  ggtitle("Average Hourly ERCOT Production by Fuel Type") +
  theme(plot.title = element_text(hjust = 0.5),
        axis.text.x = element_text(angle = 90)) +
  facet_wrap(~fueltype)
#>

The plots show that in all five years over our sample period, the majority of the electricity in Texas is produced from fossil fuels. The next largest share has nuclear power. Then follows wind energy, the share of which grows from about $3\%$ in 2007 to more than $9\%$ in 2011. Hydro power and others have only a very small share in the energy mix. So most of the energy in Texas is from power plants emitting pollutants, but wind energy has a substantial, growing part in the energy mix, ensuring large variations in output, which we can use for our investigation. We can also see that hydropower makes up only a very small part, which is important for our results not to be distorted (as explained in the info box "Why Texas?" above).

In this exercise you learned a lot about the characteristics of wind power. Next you will be introduced to the dataset we will use in the following analysis.

Exercise 1.3 -- Introduction: The Dataset

Now it is time to prepare for the analysis starting in the next exercise. We will take a look at the dataset that we will be using for the more theoretical part in the next few exercises and I will provide a bit more background information on the data. Load the dataset data_theory.rds and show ten random rows.

#< task
# Load the dataset and save it as 'data_theory'.

# Then show ten random lines from the new dataset.
#>

data_theory = readRDS("data_theory.rds")
sample_n(data_theory, 10)

This dataset contains hourly data on wind generation, electricity demand and CO$_2$ emissions from fossil fuel power plants for the period from January 1, 2007 to December 31, 2011 (see also the info box below). Columns datetime, hour and wind, are the same as before. tx_co2 contains the aggregate hourly CO$_2$ emissions (in $\text{tons}$) from fossil fuel power plants connected to the ERCOT grid. The data in this column is provided by the Environmental Protection Agency (EPA) (cf. section II in Novan's paper). The next three columns, load, load2 and load3, contain the hourly electricity demand (in $\text{MWh}$) in the ERCOT market and its square and cube, respectively. The last three columns, spp_load, spp_load2 and spp_load3, contain the same for the neighbouring Southwest Power Pool (SPP) market in Oklahoma. This data is provided by the Federal Energy Regulatory Commission (FERC). Remember that, if you are unsure about any variable at some point in this problem set, you can check the variable description by clicking on data in the respective code chunk. The variable description will also be shown in a table when hovering over a column name with the cursor.

< info "A Remark Concerning the Dataset"

There are a few observations missing in our dataset. On March 11, 2007, March 9, 2008, March 8, 2009, March 14, 2010 and March 13, 2011, hour 3 is missing each time and in 2009, the observations for March 31 are missing altogether. Due to the large number of total observations, though, this does not pose a problem in our analysis.

>

In this exercise you got to know the dataset we will use for the analysis in the following exercises. The next exercise will introduce you to the first regression which seeks to estimate the average emissions offset by wind production. At the same time we will learn the basics needed to perform regressions and interpret the results.

If you are not interested in seeing how the model is set up and are already familiar with the classical linear regression model and ordinary least squares estimation, you may skip exercises 2.1-2.5 and continue with exercise 2.6.

Exercise 2.1 -- Building the Model: A Simple Model for Estimating Average Emissions Offset by Wind Production

Now that we are familiar with the dataset, we can start with our first regression. We want to estimate how the output from wind turbines influences the emission of CO$_2$, NO$_x$ and SO$_2$ from fossil fuel power plants. To estimate the relation, Novan is using a linear regression model, which means that wind output is assumed to influence pollutant emissions linearly. The formula for our first, basic model looks as follows: $$E_t = \beta_0 + \beta_1 \cdot W_t + \varepsilon_t \qquad (2.1-1)$$
The regression is done for each of the three pollutants individually. $E_t$ denotes the aggregate hourly emission of CO$_2$ (in $\text{tons}$), NO$_x$ (in $\text{lbs}$) or SO$_2$ (in $\text{lbs}$), respectively, at time $t$. $W_t$ is the aggregate hourly energy generated by wind turbines in the ERCOT grid (in $\text{MWh}$) in period $t$. In econometrics, it is always assumed that relationships such as the one above are not exact equations, but that they are influenced by a random disturbance, making them stochastic (Kennedy (2008, chapter 1.2)). That is why the equation for our regression model includes the error term $\varepsilon_t$. $\beta_0$ is a constant and describes the emission of pollutants without wind energy. $\beta_1$ is the parameter characterising the relation between pollution and wind output. It describes how, on average, one $\text{MWh}$ of ERCOT wind generation changes the emission of the pollutant. (2.1-1) is not yet the model Novan uses, but a simpler version. We will extend the model successively and discuss the changes in each step.

You may have noticed that the formula above is actually a set of formulas, one for each observation at time $t$ in our dataset. The goal is now to find those values for $\beta_0$ and $\beta_1$, which best match our set of observations for the explanatory variable $W$ and the dependent variable $E$ over all periods. There are different interpretations of what is considered a good or the best estimate for $\beta$ and also different methods to estimate $\beta$ depending on the characteristics of the data. Novan uses a method called Ordinary Least Squares (OLS) in his paper. We will take a closer look at how this method works and why it is justified to use it for our data in the following exercise.

Let's perform a regression in R ourselves. In order to learn the basics of linear regressions and OLS we will only focus on one of the pollutants, CO$_2$, for now. We should first think about what outcome we expect from the regression. This will help us interpret the results later on.


< quiz "Winds Coefficient I"

question: Which unit does the coefficient for W have? Recall that W is measured in MWh and E in tons.

sc: - MWh - tons - tons/MWh* - MWh/ton success: Correct, the coefficient for W is measured in tons/MWh. failure: Your answer is not correct. Remember that the units on the left-hand side and the right-hand side of an equation have to match up.

>


< quiz "Winds Coefficient II"

question: Do you expect a positive or a negative relation between wind production and emission of CO2? That is, do you expect the coefficient for W in the regression to have a positive or a negative sign? Or do you think it will be zero after all?

sc: - I expect it to be positive. - I expect it to be negative.* - I expect it to be zero. success: Correct, we can expect the coefficient to be negative. As was already explained in exercise 1.1, we expect production from emission-rich fossil fuel power plants to decrease in response to the supply of wind energy to the grid. Consequently, the emission of pollutants should be reduced. failure: Your answer is not correct. It might help if you ask yourself whether the emission of pollutants increases or decreases when the output from wind turbines becomes larger. You can also consult equation (2.1-1) again.

>


Since this is a new exercise, we need to reload the file data_theory.rds. Click edit on the next code chunk and then enter your solution.

#< task
# Load the dataset 'data_theory.rds' and save it as 'data_theory'.
#>

data_theory = readRDS("data_theory.rds")

In order to perform the linear regression in R, we use the function lm(). If you are not familiar with the function, you can take a look at the info box below. Recall that the output from wind turbines is called wind in our dataset while the CO$_2$ emissions are denoted by tx_co2. Complete the provided command below to perform the regression of tx_co2 on wind. Store the result as reg_wind.

#< task
# Run the regression of 'tx_co2' on 'wind' and store the result in 'reg_wind'.

# Use the following structure:
# ... = lm(... ~ ..., data = ...)
#>

reg_wind = lm(tx_co2 ~ wind, data = data_theory)

< award "Regressor Lv.1"

Congratulations, you successfully performed your first regression!

>

< info "Linear Regressions with 'lm()'"

The function lm() estimates a linear regression model. The syntax of the function call looks like this:
lm(dependentvariable ~ explanatoryvariable1 + explanatoryvariable2 + ..., data = dataset)
To the left of the tilde you have to specify the dependent variable, in our case tx_co2. The explanatory variables are entered to the right of the tilde, separated by a +. We only have one explanatory variable here, wind. The constant does not have to be stated explicitly. Finally, one has to define the data frame that contains the data for the regression (note that it is optional to enter the data = before the data frame's name).
Apart from the estimates for the coefficients, lm() also computes a number of summary statistics and hypothesis tests. When the function is assigned to a variable, all of these are stored in that variable and can be retrieved at a later date.
An introduction to lm() with examples is given in Davies (2016, chapter 20.2.3).

>

The next code chunk uses the function stargazer from the identically named package to create a table that gives a nice overview of the regression results. It is similar to the tables in Novan's paper and other scientific publications. More on stargazer is in the info box below. Click check to run the code.

#< task_notest
# The following lines load the package 'stargazer' and produce a table
# with the regression results.
library(stargazer)

stargazer(reg_wind,
          type="html", style="aer",
          keep.stat=c("rsq","n"),
          covariate.labels="ERCOT Wind [MWh]",
          dep.var.labels="CO2 [tons]",
          star.cutoffs = c(0.05, 0.01),
          notes = "**Significant at the 1% level. <br> *Significant at the 5% level.",
          notes.append = FALSE)
#>

< info "Creating Regression Tables with 'stargazer'"

stargazer() creates LaTeX-, HTML- or ASCII code for tables with regression results. It is popular among researchers because it allows easy transfer of the results from R to other programmes and has a lot of options to adapt both layout and content of the table. Furthermore, there are templates which can be applied to make the tables look like those of major scientific journals.

>

For now we will only discuss certain aspects of the table above. The other parts will be explained over the course of the next exercises. First, let's see if you know what some of the values in the table mean.


< quiz "Results from Simple Regression"

parts: - question: 1. Which parameter from equation (2.1-1) has the value -1.067? choices: - E - beta_0 - beta_1 - W - epsilon multiple: FALSE success: Great, your answer is correct! failure: Your answer is not correct. Try again. - question: 2. Which parameter from equation (2.1-1) is labelled with 'Constant' in the table? choices: - E - beta_0 - beta_1 - W - epsilon multiple: FALSE success: Great, your answer is correct! failure: Your answer is not correct. Try again.

>


< award "Interpreter"

Congratulations, you correctly identified the values and thus took the first step to correctly interpreting the results from a regression!

>

The regression yields estimates for $\beta_0$ and $\beta_1$. We are mainly interested in $\beta_1$, which characterises the relation between pollution and wind output. Its estimate is negative which is what we expected. It means that the emission of CO$_2$ decreases as the output from wind turbines increases. To be exact, according to the regression results, one $\text{MWh}$ of output from ERCOT wind turbines reduces the emission of CO$_2$ from ERCOT fossil fuel power plants on average by $1.067\,\text{tons}$. The line labelled "Constant" contains the estimate for $\beta_0$ and describes the amount of CO$_2$ emitted when there is no electricity generation from wind turbines. It is of little importance for our analysis. "Observations" indicates the number of observations in our dataset.

With our simple regression above we produced an estimate for the parameter of interest, $\beta_1$. But how can we tell whether the model we used, describes the reality well? And how do we know that the methods we applied are actually suited for our data? To answer these questions, the next exercise will introduce you to the theoretical background for the linear regression model and OLS.

If you are already familiar with the assumptions of the classical linear regression model and ordinary least squares estimation, you may skip the following exercise and continue with exercise 2.3.

Exercise 2.2 -- Building the Model: The Classical Linear Regression Model

In the last exercise we saw what a linear regression is and how we can perform these in R. To be able to judge how well a regression model fits the data and to interpret the results, we need to know more about the theory behind linear regressions.

a) Assumptions for the Classical Linear Regression Model

According to Greene (2012, chapter 2) and Kennedy (2008, chapter 3.2), there are the following six assumptions for the classical linear regression (CLR) model:

1. Linearity: $$y_t = \beta_0 + \beta_1 x_{t1} + \beta_2 x_{t2} +...+ \beta_K x_{tK} + \varepsilon_t$$ The dependent (or explained) variable $y$ is a linear combination of the set of $K$ independent (or explanatory) variables $x_k=(x_1, x_2, ..., x_K)$ plus an error (or disturbance) term ${\varepsilon}$. We are working with time series data with $T$ observations, indexed by $t=1,2,...,T$. $\beta_k=(\beta_0, \beta_1,...,\beta_K)$ is a vector of true coefficients and is constant for all observations $t$.
In matrix notation, the equation above looks as follows: $$y = X \beta + \varepsilon$$ with $$X = \begin{pmatrix} 1 & x_{11} & \dots & x_{1k} \\ \vdots & \vdots & \ddots & \vdots \\ 1 & x_{T1} & \dots & x_{Tk} \end{pmatrix}$$

2. Exogeneity: $$E[\varepsilon_t|x_{s1},x_{s2},...,x_{sK}]=0$$
The expected value of the error term $\varepsilon_t$ conditional on the independent variables of all time periods $s$ is zero. This means that the disturbance for time period $t$ is not dependent on any independent variable at any time period $s$ (including $t$).
Furthermore, it is assumed that each disturbance $\varepsilon_t$ is independent of all other disturbances $\varepsilon_{t \neq s}$: $$E[\varepsilon_t | \varepsilon_1,...,\varepsilon_{t-1},\varepsilon_{t+1},...,\varepsilon_T]=0$$ Altogether we assume that the disturbances are random.
Greene (2012, chapter 2.3.3) states that the above also implies that for each $\varepsilon_t$ the unconditional mean $E[\varepsilon_t]=0$ and $Cov[\varepsilon_i,X]=0$. So the error term and the explanatory variables are uncorrelated. In that case we speak about exogenous variables.

3. Homoscedasticity and non-autocorrelation: $$Var[\varepsilon_t | X]=\sigma^2~~~\forall \thinspace t=1,2,...,T$$ The error terms have a constant variance $\sigma^2$. This characteristic is called homoscedasticity.
And also: $$Cov[\varepsilon_t, \varepsilon_s| X]=0~~~\forall \thinspace t \neq s$$ Observations of the disturbance are not correlated among themselves. This is referred to as non-autocorrelation.
Homoskedasticity and non-autocorrelation can be represented by the variance-covariance matrix of the error term $$Var[\varepsilon | X]=\sigma^2 I = \begin{pmatrix} \sigma^2 & 0 & \dots & 0 \\ 0 & \sigma^2 & \ddots & \vdots \\ \vdots & \ddots & \ddots & 0 \\ 0 & \dots & 0 & \sigma^2 \end{pmatrix}$$ where $I$ is the identity matrix.

4. Full Rank: The explanatory variables $x_k$ are linearly independent and there are at least as many observations as explanatory variables.

5. Data Generation: $X$ can be fixed or random (or a mixture of both).

6. Normality: $$\varepsilon| X \sim N[0,\sigma^2 I]$$ The disturbances follow a normal distribution.

b) Finding the parameters $\beta_k$

Now that we know the assumptions which the CLR model entails, we still need methods to actually estimate the parameters $\beta_k$. There are several criteria which are important for a "good" estimator. However, we will not go into much detail about most of them in this problem set. If you are interested in a short description of the criteria, take a look at the info box below.

< info "Criteria for Judging Estimators"

The significance of each criterion varies depending on the application and is, to some extent, up to each researcher's discretion.

(Kennedy (2008, chapter 2))

>

Kennedy (2008, chapter 3.3) and Greene (2012, chapter 4) agree that, in the case that the assumptions of the CLR model are met, the preferred estimator is the Ordinary Least Squares (OLS) estimator. From this estimator we get estimates for the $\beta_k$, which we will denote by $\hat{\beta}k$. Similarly, we denote the fitted values of the independent variable $y$ as $\hat{y}$. They are defined in the following way: $$\hat{y}_t = \hat{\beta}_0 + \hat \beta_1 x{t1} + \hat \beta_2 x_{t2} + ... + \hat \beta_K x_{tK}$$ The OLS estimator finds the set of estimates $\hat{\beta}$ which minimises the sum of squared residuals (SSR). The residuals $\hat{\varepsilon}$ are the estimates of the disturbance $\varepsilon$ and defined as: $$\hat{\varepsilon}=y-\hat{y}=y-X\hat{\beta}$$ Therefore we have: $$\hat{\beta}=\text{arg min} \sum_{t=1}^T \hat{\varepsilon}_t^2=(X'X)^{-1}X'y$$

If one or more assumptions for the CLR model are violated, the OLS estimator might no longer be the best option and other estimators have to be used that are better suited to estimate the model at hand. Although there are methods for testing the assumptions (or at least getting an idea whether they could be violated), finding out whether they are fulfilled for a given model is very difficult and a detailed consideration well beyond the scope of this problem set. For the interested reader, I suggest Kennedy (2008, chapters 6-12) for an overview of violations of the above assumptions and methods of testing for violations.

The model we used in exercise 2.1 is very basic and while we are looking to keep the model as simple as possible, it does, as you might already have guessed, have a few shortcomings. Don't worry if you are unsure whether some of the assumptions are met in our case. Finding a model representing the real-world relations sufficiently accurately and then assessing whether it is a CLR model is not an easy task. Also, there is not one correct model and there might be several suitable models. In exercises 2.3-2.5 we will highlight a few topics on the way to a model that is better suited to describe the relations we seek to explore. We will follow Novan's model while doing so.

Exercise 2.3 -- Building the Model: Extending the Model

a) An Extended Model

In exercise 2.1 we estimated a model which regressed the CO$_2$ emissions from fossil fuel power plants solely on the output from wind turbines in Texas. But might there be other variables we should consider in our model? To answer this question, let's take another look at the regression results. Run the next two code chunks to load the dataset, perform the regression according to (2.1-1) again and produce the table with results.

#< task
# Load the dataset 'data_theory.rds' and save it as 'data_theory'.

data_theory = readRDS("data_theory.rds")
#>
#< task_notest
# Run the regression of 'tx_co2' on 'wind' again and store the result
# in 'reg_wind'.

reg_wind = lm(tx_co2 ~ wind, data = data_theory)

# The following lines load the package 'stargazer' and produce a table
# with the regression results.

library(stargazer)
stargazer(reg_wind,
          type="html", style="aer",
          keep.stat=c("rsq","n"),
          covariate.labels="ERCOT Wind [MWh]",
          dep.var.labels="CO2 [tons]",
          star.cutoffs = c(0.05, 0.01),
          notes = "**Significant at the 1% level. <br> *Significant at the 5% level.",
          notes.append = FALSE)
#>

We'll now turn to the last row in the table above labelled "$R^2$". The coefficient of determination $R^2$ is a measure for the goodness of fit and describes how much of the total variation in the dependent variable $y$ can be explained by the fluctuations in the explanatory variables $X$ (Greene (2012, chapter 3.5)). It is defined as $$R^2=1-\frac{\text{SSR}}{\text{SST}}=1-\frac{\sum_{t=1}^T\hat{\varepsilon}t^2}{\sum{t=1}^T(y_t-\bar y)^2}$$ where - $\text{SSR}$ is the sum of squared residuals, - $\text{SST}$ is the total sum of squares and - $\bar y$ is the mean of the dependent variable $y$.

(Notation based on Wooldridge (2013, chapter 2.3))

$R^2$ takes values between $0$ and $1$. If $R^2=0$, then the independent variables have no explanatory power. It is therefore desirable to have an estimator with a high $R^2$. In fact, since OLS finds the estimator which minimises the sum of squared residuals, it automatically maximises $R^2$.
We have $R^2=0.076$, suggesting that the fluctuations in wind generation alone can only explain little of the variation in CO$_2$ emissions. While this is the relation we are interested in (and it is a nice and simple model), the value for $R^2$ indicates that we might need to control for other factors influencing emissions as well. One of these factors is maybe quite obvious.


< quiz "Proper Specification"

question: Can you think of which additional explanatory variable Novan includes in his model?

sc: - Temperature - Electricity production from fossil fuel power plants - Electricity demand* success: Great, your answer coincides with Novan's reasoning! But be aware that there can be various models for our application and other explanatory variables might be just as valid. failure: This is not what the author chose to include in his model. There can be various models for our application though and your answer might be just as valid. Try again.

>


It seems intuitive that a higher demand for electricity requires higher production, in general and of conventional power plants in particular, and should therefore influence emissions. But including electricity demand has further advantages. As already mentioned in exercise 1.2, the output from wind turbines is mainly determined from the wind conditions, which are of course exogenous. But Novan also points out that weather conditions can influence both demand and wind generation. If demand is not included as an explanatory variable in the model, it is implicitly included in the disturbance term, which would then be correlated with $W$ and be a source for endogeneity. Furthermore, we also control for curtailments that arise because the short-term electricity demand falls. If there is more excess electricity than the grid's transmission limits allow to be exported, wind turbines might need to curtail production (cf. Novan, pages 298-299).

Novan uses the hourly electricity usage in the ERCOT grid as a measure for electricity demand in Texas. The extended model then looks the following way: $$E_t = \beta_0 + \beta_1 \cdot W_t + \beta_2 \cdot D_t + \varepsilon_t \qquad (2.3-1)$$ where $D_t$ denotes ERCOT electricity demand at time $t$.

b) Consequences of Extending the Model and the Omitted Variable Bias

Adding a second explanatory variable not only gives us an estimate of this variable's coefficient $\beta_2$, but usually also changes the estimates for the other coefficients.


< quiz "Introducing demand I"

question: Do you expect emissions to increase or decrease with electricity demand, i.e. do you expect the coefficient for D to be positive or negative?

sc: - positive* - negative success: That is correct, the coefficient for ERCOT demand can be expected to be positive. A higher demand for electricity increases production and should therefore in general also increase emissions from power plants. failure: Your answer is not correct. Try again.

>


< quiz "Introducing demand II"

question: What do you expect to happen to the coefficient of W when we run a regression which also includes D as an explanatory variable? Compare it to the regression from exercise 2.1, where we regressed emissions on wind generation only.

sc: - I expect it to become positive. - I expect it to become zero. - I expect it to still be negative, but become bigger (i.e. less negative).* - I expect it to still be negative, but become smaller (i.e. more negative). - I expect it to stay the same. success: Great, your answer is correct. We will see that the coefficient for ERCOT wind generation is closer to zero when introducing demand into our model. failure: Your answer is not correct. Try again.

>


< award "Modeller"

Congratulations, you identified another explanatory variable and formulated your expectations for the parameters of the new model!

>

Let's see if we get the results we expect. Use the function lm() to run the regression. Recall that the hourly ERCOT electricity demand is called load in our dataset. The structure of the required command is given in the next code chunk. If you cannot remember how to use lm(), you can go back to exercise 2.1. Your progress in this exercise will be saved.

#< task
# Run the regression of 'tx_co2' on 'wind' and 'load' (in that order)
# using 'data_theory'. Save the results as 'reg_wind_dem'. 

# Your command should look like this:
# ... = lm(... ~ ... + ..., data = ...)
#>

reg_wind_dem = lm(tx_co2 ~ wind + load, data = data_theory)

< award "Regressor Lv.2"

Congratulations, you just used lm() to perform your first regression with more than one explanatory variables!

>

The following code chunk creates a table which puts the results from both the simple and the extended regression side by side.

#< task_notest
# The following command produces a table with the results from both regressions.

stargazer(reg_wind, reg_wind_dem,
          type="html", style="aer",
          keep.stat=c("rsq","adj.rsq","n"),
          covariate.labels = c("ERCOT Wind [MWh]", "ERCOT Demand [MWh]"),
          dep.var.labels = "CO2 [tons]",
          model.numbers = FALSE,
          column.labels = c("Simple", "Extended"),
          star.cutoffs = c(0.05, 0.01),
          notes = "**Significant at the 1% level. <br> *Significant at the 5% level.",
          notes.append = FALSE)
#>

The regression yields a value of $\beta_2=0.653\,\text{tons/MWh}$ for the coefficient of our demand variable load. So according to the estimation, one additional $\text{MWh}$ of electricity consumption increases emissions of CO$_2$ by $0.653\,\text{tons}$. As expected, there is a positive relation between demand and emissions.
The coefficient for wind production did indeed increase compared to the regression without demand and is now $\beta_1=-0.619\,\text{tons/MWh}$. This means that the extended model estimates the influence of wind power on CO$_2$ emissions to be about $40\%$ lower than the simple model.

Now let's see how adding demand to our model has influenced $R^2$. The table shows that it increased tremendously and the extended model has $R^2=0.927$. This indicates that our reasoning for including demand was correct and it does indeed explain a lot of the fluctuations in CO$_2$ emissions. For more information on $R^2$ and why we have to be careful when drawing conclusions from it, see the info box below.

< info "R-Squared and Adjusted R-Squared"

According to Kennedy (2008, chapter 2.4), $R^2$ can be used as a criterion for the quality of an estimator when the form of the model and its explanatory variables are already determined. But it can also be used in the search for the specification of a model. In this application, however, $R^2$ has to be treated carefully because it will tempt the user to include too many explanatory variables in the model (Kennedy (2008, chapter 5.5)). This is due to the fact that adding another independent variable can never lead to a larger minimised SSR and therefore $R^2$ cannot become smaller. It will either stay the same or increase, the latter suggesting that the additional regressor should be included in the model. It is therefore common to correct $R^2$ for the degrees of freedom, $T-K$, when comparing different models. This penalises the addition of another variable and can cause $R^2$ (which is now called adjusted $R^2$ or $\bar R^2$) to decrease if the additional variable explains only little of the variation in the dependent variable. This means that only those variables should be included in the model, which cause $\bar R^2$ to increase.
The table above shows both $R^2$ and the adjusted $R^2$. As you can see, the two do not differ in the displayed digits. This is because the number of explanatory variables $K$ is small compared to the number of observations $T$ and adjusting for the degrees of freedom is therefore of little consequence. From now on the tables with regression results in this problem set will only show the adjusted $R^2$.

>

Introducing demand to our model changed the estimated impact wind energy has on CO$_2$ emissions: $\hat \beta_1$ in the simple regression was biased because we left out a variable that has an effect on the dependent variable $E$. This is referred to as the omitted variable bias. According to Wooldridge (2013, chapter 3.3), the omitted variable bias in the case of one omitted variable is related to the sign of $\beta_2$ and $corr(x_1, x_2)$ as follows: $$bias(\hat \beta_1^) \sim sign(\beta_2) \cdot sign(corr(x_1, x_2)) \qquad (2.3-2)$$ where - $\hat \beta_1^$ is the estimate of $\beta_1$ from the regression without the second explanatory variable, - $\beta_2$ is the coefficient of the omitted variable and - $corr(x_1, x_2)$ is the correlation between the two explanatory variables.

Can we verify this for our situation? We know that demand is positively related to emissions, so $\beta_2$ is positive. But how are demand and wind generation correlated?


< quiz "Correlation of Wind and Demand"

question: How do you expect generation from wind turbines and demand in Texas to be correlated?

sc: - I expect them to be uncorrelated. - I expect them to be positively correlated. - I expect them to be negatively correlated.* success: Great, your answer is correct. We will indeed see that 'wind' and 'load' in our dataset are positively correlated. failure: Your answer is not correct. Try again.

>


Let's get an idea from a visual representation of average wind production and demand in the course of a day. In the next code chunk, create a new data frame from data_theory, which contains columns for the hourly mean of load and wind. Use the pipe operator %>% and the functions group_by() and summarise() from dplyr. The idea is to first specify that we are using data from data_theory, then to group it by hour and finally to compute the mean of load and wind. The mean is calculated with mean(). To help you in solving this kind of task on your own for the first time, the structure of the command is given below. Also, remember that you have to load dplyr again. You saw how to use all of the functions in exercise 1.2, so you can go back again to take another look. And don't forget that if you need help, you can click on hint.

#< task
# First, load the package 'dplyr'.

# Then compute the hourly mean of 'wind' and 'load' from 'data_theory'
# and store the results in a new data frame called 'data_corr'.

# Adapt these lines for your answer (be aware of the capitals at the
# beginning of the new variable labels):
# ... = ... %>%
#   group_by(...) %>%
#   summarise(Load = ...,
#             Wind = ...)
#>

library(dplyr)
data_corr = data_theory %>%
  group_by(hour) %>%
  summarise(Load = mean(load),
            Wind = mean(wind))

Show some lines of the new data frame using head().

#< task
# Show the first lines of 'data_corr' using 'head()'.
#>

head(data_corr)

You can see that data_corr has one row for each hour and three columns containing the hour and the mean of wind and load. Before we can plot, we have to transform the data into long format using melt() from reshape2 again. This time we want to keep hour from data_corr and store the mean of load and wind in one column called mean while creating a new column type specifying whether the row describes a value for wind production or for demand. If you don't remember how to transform data from wide to long format, you can go back to exercise 1.2.

#< task
# Load the package 'reshape2' and melt 'data_corr' as described above.
# Use 'variable.name="type"' and 'value.name="mean"'.

# Store the long-format data frame in 'data_corr_melt'.

# Finally, show ten random lines from the new data frame.
#>

library(reshape2)
#< hint
display("Don't forget to load 'reshape2' first. As always, use 'library()' to do so.")
#>
data_corr_melt = melt(data_corr, id.vars="hour",
                      variable.name="type", value.name="mean")
#< hint
display("The command for melting 'data_corr' is analogous to the one we used in exercise 1.2. You have to assign the following values to the arguments: id.vars=\"hour\", variable.name=\"type\" and value.name=\"mean\". Remember that the new data frame should be called 'data_corr_melt'.")
#>
sample_n(data_corr_melt, size=10)
#< hint
display("Don't forget to use 'sample_n()' to show ten random lines of 'data_corr_melt' in the end.")
#>

< award "Melter Lv.2"

Congratulations, you melted a data frame on your own!

>

Now we can plot the data. We will produce two plots next to each other, one for mean wind production and one for mean demand. Enter part of the command for the plot yourself this time. The structure of the command including more specifications for the layout of the graph is given in the code chunk below. In the first part, ggplot(), you can specify the dataset as well as hour as the x-axis and mean as the y-axis. The argument color=type makes the two graphs have different colours. Then you add a layer geom_line(). The argument size=1.5 makes the lines 1.5 times thicker than the default value. Don't forget to load ggplot2 first.

#< task
# Create two plots showing the mean hourly demand and wind generation.

# Load 'ggplot2' first and then adapt the following lines:

# ggplot(..., aes(x=..., y=..., color=type)) +
#   geom_line(size=1.5) +
#   xlab("Hour") +
#   ylab("Mean [MWh]") +
#   ggtitle("Comparison of Average Hourly Wind Output and Demand") +
#   theme(plot.title = element_text(hjust = 0.5)) +
#   facet_wrap(~type, scales="free_y")
#>

library(ggplot2)
ggplot(data_corr_melt, aes(x=hour, y=mean, color=type)) +
  geom_line(size=1.5) +
  xlab("Hour") +
  ylab("Mean [MWh]") +
  ggtitle("Comparison of Average Hourly Wind Output and Demand") +
  theme(plot.title = element_text(hjust = 0.5)) +
  facet_wrap(~type, scales="free_y")

< award "Plotter"

Well done, you plotted your first ggplot-graph!

>

The average hourly wind production seems to be higher in times of lower average hourly demand and vice versa (but be aware of the different scales on the y-axes). This suggests a negative correlation between wind production and demand. This can be checked by calculating the correlation of wind and load. For that, we use the function cor() with the two variables specified in brackets. You can access a column in a data frame in R by using the notation dataframe$variable.

#< task
# Calculate the correlation between 'wind' and 'demand'
# from the dataset 'data_theory'.
#>

cor(data_theory$wind, data_theory$load)

The computation yields a correlation of $-0.12$, so hourly demand and wind production in Texas are in fact negatively correlated in our dataset. See the info box below for an additional remark on the correlation.

< info "A Note on the Correlation between Wind Generation and Demand"

Looking at the graphs we plotted above, one might expect a larger correlation in absolute values. And indeed, when computing the correlation for the mean hourly demand and wind generation, it is higher according to amount:

# Load the dataset and create 'data_corr' as above.
data_theory = readRDS("data_theory.rds")

data_corr = data_theory %>%
  group_by(hour) %>%
  summarise(load = mean(load),
            wind = mean(wind))

# Compute the correlation of mean hourly demand and wind generation.
cor(data_corr$wind, data_corr$load)

The relevant correlation, however, is the one for the hourly values.

>

We know now that demand positively influences emissions and that demand and wind production have a negative correlation. According to equation (2.3-2), this means that $\hat \beta_1^*$ should be negatively biased (i.e. the estimate should be smaller for the model without demand). And this is exactly what we found when comparing $\hat \beta_1$ in our two regressions. It had a smaller value when regressing CO$_2$ emissions only on wind production compared to when regressing it on both ERCOT wind production and demand.

c) The Model With All Explanatory Variables

We extended our first, very simple regression model by introducing ERCOT demand as an explanatory variable for CO$2$ emissions. However, Novan also controls for demand in neighbouring grids in his model of the effect of wind turbines in the ERCOT grid on emissions of conventional power plants. The info box "Why Texas?" in exercise 1.2 explained that the ERCOT grid in Texas is fairly isolated, but is connected to the SPP grid in Oklahoma. Therefore ERCOT wind energy might have an effect on SPP as well as ERCOT fossil fuel power plants. That is why Novan incorporates SPP demand in his model. Additionally, he includes the squares and cubes of hourly ERCOT and SPP demand, because emissions might also be nonlinearly influenced by electricity load. The model including all variables looks as follows: $$E_t = \beta_0 + \beta_1 \cdot W_t + \sum{n=1}^3 (\theta_{1,n} \cdot D_{1,t}^n + \theta_{2,n} \cdot D_{2,t}^n) + \varepsilon_t \qquad (2.3-3)$$ Here $D_1$ denotes ERCOT demand and $D_2$ SPP demand.

Let's perform the regression. The hourly SPP demand is saved in the column spp_load in our dataset. Enter the explanatory variables in this order: wind, load, load2, load3, spp_load, spp_load2, spp_load3.

#< task
# Run the regression of 'tx_co2' on all explanatory variables as specified
# in (2.3-3) and store the result in 'reg_allvar'.
#>

reg_allvar = lm(tx_co2 ~ wind + load + load2 + load3 +
                  spp_load + spp_load2 + spp_load3, data = data_theory)

Now run the next code chunk to create a table showing the results for all three models.

#< task_notest
# These lines produce a table with the regression results.

stargazer(reg_wind, reg_wind_dem, reg_allvar,
          type="html", style="aer",
          keep.stat=c("adj.rsq","n"),
          covariate.labels = c("ERCOT Wind [MWh]", "ERCOT Demand [MWh]",
                               "Squares ERCOT Demand", "Cubes ERCOT Demand",
                               "SPP Demand [MWh]", "Squares SPP Demand",
                               "Cubes SPP Demand"),
          dep.var.labels = "CO2 [tons]",
          model.numbers = FALSE,
          column.labels = c("Simple", "Extended", "All Variables"),
          star.cutoffs = c(0.05, 0.01),
          notes = "**Significant at the 1% level. <br> *Significant at the 5% level.",
          notes.append = FALSE)
#>

The results show that this model does, according to the values for the adjusted $R^2$, indeed have slightly more explanatory power than the model in equation (2.3-1). You can also see that adding the additional variables for demand made the estimated impact of wind energy on CO$_2$ emission go down further: we now have $\beta_1=-0.578\,\text{tons/MWh}$. Note that value for SPP demand in the last hour of 2011 is missing in our dataset, which is why the regression with all explanatory variables is using one observation less than the other two.

The model now includes all explanatory variables that Novan uses to estimate the impact of wind generation on emissions. See the info box below for a few additional remarks. Remember again that we will use the same independent variables for all three pollutants.

In the next exercise we will learn how to deal with seasonality and autocorrelation in our data.

< info "A Few More Notes on the Model"

Two things to note before we move on to the next exercise. First, including the squares and cubes of demand in our model does not contradict the linearity assumption in the CLR model. As Kennedy (2008, chapter 6.3) points out, such an equation, while not linear in variables, is still linear in parameters and thus fulfils the first assumption in the CLR model. You can think about this as defining, for example, a new explanatory variable $z=x^2$.
Secondly, note that we specified a static model, which means that we only use contemporaneous variables, no lagged variables from passed time periods (see Wooldridge (2013, chapter 10.2)).

>

Exercise 2.4 -- Building the Model: Seasonality

One phenomenon which is often encountered when dealing with time series data is seasonality. It occurs when variables experience periodic fluctuations, for example throughout each year. Seasonality can be accounted for in a regression model by including fixed effects (Wooldridge (2013, chapter 10.5)). These can be modelled through a set of dummy variables which have the value $1$ if the criterion described through that dummy variable is applicable in time period $t$ and $0$ otherwise. Say, for example, you want to model how many drinks are sold at a bar. It is likely that more people will go out on Fridays and Saturdays because they do not have to work the next day (and maybe they are also inclined to have more drinks on these days than the rest of the week). This could be taken into account by including weekday-dummies in the model. One category (in our example this could be Monday) serves as the base category and a dummy variable is included for each of the other categories.

It is especially problematic when the residuals in a regression show autocorrelation, be it from seasonal effects or another source. In that case the third assumption for the CLR model is violated and the standard errors of the estimates are biased. Let us see whether we can detect any possible seasonality by plotting the autocorrelation function (ACF) of the residuals from the linear regression model in (2.3-3). If you are not familiar with the ACF or reading ACF plots, check the info box below. Run the next two code chunks to load the dataset and perform the regression again.

#< task
# Load 'data_theory.rds and store it in `data_theory`.

data_theory = readRDS("data_theory.rds")
#>
#< task
# Run the regression of 'tx_co2' on all variables as specified in (2.3-3)
# and store the result in reg_allvar.

reg_allvar = lm(tx_co2 ~ wind + load + load2 + load3 +
                  spp_load + spp_load2 + spp_load3, data = data_theory)
#>

< info "The Autocorrelation Function"

The autocorrelation function of a time series computes the correlation of an observation at time $t$ with those of preceding points in time $t-l$, where $l$ is the lag (Brockwell & Davis (2002)). It can be visualised with an autocorrelation plot. Here the lag is on the x-axis and the autocorrelation is on the y-axis. For each lag, a vertical line shows the strength of the correlation. At $l=0$, the autocorrelation is always one because it describes the correlation of a value with itself. Two horizontal lines above and below the x-axis indicate the significance level.

>

Now we need to obtain the residuals $\hat{\varepsilon}_t$ from the regression and compute their autocorrelation function. This can be achieved by using the function resid() to extract the residuals and applying the function acf() on them. acf() computes and plots the ACF. The argument lag.max = ... allows you to set the maximum lag for which the ACF is computed. Use only one line of code in the code chunk below by nesting the two functions. See the info box below for an example. The maximum lag for the autocorrelation function should be set to $100$.

< info "Nesting Functions: Example"

Suppose you want to take the square root of a number's absolute value. You could do that by first calculating the absolute value of the number with abs(), save the result in a variable and then apply sqrt() on this variable:

x = -9
y = abs(x)
y
sqrt(y)

A shorter way is to combine the two commands without saving any intermediate results:

sqrt(abs(x))

>

#< task
# Compute the autocorrelation function for the residuals of 'reg_allvar'.
# Set the maximum lag to 100.
#>
acf(resid(reg_allvar), lag.max=100)

< award "Nesting Correlator"

Well done, you nested two functions and computed the autocorrelation function of the residuals from our regression!

>

The ACF-plot shows an oscillating behaviour with high levels of autocorrelation at lags of multiples of $24$. This indicates not only that the residual of one hour is correlated to the residuals of the preceding hours, but also that is has the highest correlations with residuals from the same hours of the preceding days. The effect, however, decreases with increasing time lag between two residuals.

Novan finds seasonality of emissions in the season, the time of the day and the day of the week. He includes hourly fixed effects for all $60$ months in the sample. This means that he assumes that the same hour on all days within one month exhibits the same (mean) level of wind. Therefore the author includes dummy variables for each combination of hour, month and year. Novan argues that, if the output from wind turbines was only determined by the available wind, these hourly fixed effects would suffice to model the seasonality. Due to curtailments, however, he suspects patterns determined by the day of the week, too. He finds, for example, significantly higher average hourly wind generation during the last hour of Sundays and the first hour of Mondays than during the last hour of Thursdays and the first hour of Fridays even though the available wind should not differ systematically in these periods (see page 301 in Novan's paper). This circumstance, too, can be seen in an ACF plot when we show higher lags. Once more, compute the ACF of our residuals, but this time with a maximum lag of $500$.

#< task
# Determine the autocorrelation function for residuals of 'reg_allvar'.
# Set the maximum lag to 500.
#>
acf(resid(reg_allvar), lag.max=500)

You can see that, after the amplitude of the oscillation decreases for the first few days, it increases again until it reaches a distance of seven days. This weekly oscillation on top of the daily one persists even after several weeks. Novan accounts for this by also allowing the hourly fixed effects to vary across days of the week. This means that there are fixed effects for each combination of hour, month, year and day of the week. This makes $24 \times 12 \times 5 \times 7=10080$ dummy variables. However, there has to be one base variable for each day of the week-month-year combination. These are dropped, making a total of $10080-7 \times 12 \times 5=9660$ dummy variables.

In addition to the fixed effects for seasonality, Novan also includes fixed effects for each of the $1826$ days in the sample. This is to account for unobserved fixed effects like, for example, the shut-down of one power plant for maintenance, which may result in another (cleaner or dirtier) power plant ramping up production for compensation. Again, one dummy variable is dropped.

The equation for the model including the fixed effects looks as follows (see also the info box below): $$E_t = \beta_0 + \beta_1 \cdot W_t + \sum_{n=1}^3 (\theta_{1,n} \cdot D_{1,t}^n + \theta_{2,n} \cdot D_{2,t}^n) + \alpha_{h,m,y,w} + \delta_d + \varepsilon_t \qquad (2.4-1)$$ $\alpha_{h,m,y,w}$ denotes the hourly fixed effect varying across month, year and day of the week and $\delta_d$ indicates the daily fixed effect.

< info "A Note on the Model Equation"

Please note that equation (2.4-1) abbreviates the dummy terms by only stating the general dummy coefficients. The complete form of $\alpha_{h,m,y,w} + \delta_d$ could look as follows: $$\sum_{h=1}^{23} \sum_{m=1}^{12} \sum_{y=2007}^{2011} \sum_{w=Su}^{Sa} \alpha_{h,m,y,w} P_{h,m,y,w} + \sum_{d=1}^{1825} \delta_d Q_d$$ Here $P_{h,m,y,w}$ stands for the hourly dummy variable and $Q_d$ stands for the daily dummy variable. $\alpha_{h,m,y,w}$ and $\delta_d$ are their respective coefficients. Each dummy variable takes the value $1$ if the observation fulfils its requirements and $0$ otherwise. For example, $Q_1$ is $1$ for observations on the first day in the sample and $0$ for all other observations.

>

We now arrived at the full specification that Novan uses to estimate the average emissions offset by wind generation. However, the author does not incorporate the fixed effects directly into the estimation as the formula above suggests. Instead, he deseasonalises both the explained variable as well as the explanatory variables. In order to deseasonalise a variable, the variable is regressed on all dummy variables (Wooldridge (2013, chapter 10.5)). The residuals $\hat{\varepsilon}_t$ from this regression constitute the new, deseasonalised variable. This is done for each variable in our model, which is then estimated using the deseasonalised variables. We will not deseasonalise all variables ourselves in this problem set because there are quite a few variables. But we will go through the procedure for the variable wind as an example.
First, a new data frame data_fe is created, which contains all variables needed for the computations. This is done using functions from the packages dplyr and lubridate. See the comments in the chunk below to learn how this is done. Just press check to run it.

#< task
# First, 'dplyr' and 'lubridate' are loaded.
library(dplyr)
library(lubridate)

# Then 'data_fe' is created using the pipe operator '%>%'. You already
# encountered it in exercise 1.2. The data for 'data_fe' comes from the
# dataset 'data_theory'.
data_fe = data_theory %>%

  # First, the variables 'datetime', 'hour' and 'wind' are selected.
  select(datetime, hour, wind) %>%

  # Then two new columns are added using 'mutate()'..
  # 'day_fe' contains the categories for the daily fixed effect. The date of
  # each observation is extracted from 'datetime' using the 'lubridate'-function
  # 'date()' and then turned into a factor variable. This makes R treat the
  # elements in this variable as categories.
  # 'hour_fe' is the variable for the hourly fixed effect. Each observation is
  # assigned a string consisting of the hour, the day in the week (denoted by
  # numbers from 1 for Sunday to 7 for Saturday), the month and the year.
  # The latter three are extracted from 'datetime' using functions from
  # 'lubridate'. The function 'paste()' is used to combine the four parts,
  # separating them with '-'. That way all observations that have the same
  # hour, weekday, month and year are assigned the same category, which is
  # exactly what we need for the hourly fixed effect. The result is again
  # turned into a factor variable. 
  mutate(day_fe=factor(date(datetime)),
         hour_fe=factor(paste(hour, wday(datetime), month(datetime),
                              year(datetime), sep="-")))

# And finally, the first few lines of the new data frame are displayed.
head(data_fe)
#>

The new data frame includes the data on hourly ERCOT wind generation, which we want to deseasonalise, as well as the variables for the daily and hourly fixed effects. In the next code chunk, the variable is deseasonalised. wind is regressed on day_fe and hour_fe using the function felm() from the package lfe. You can read more on the function and how to use it in the info box below. The residuals are extracted and saved as wind_des. You just have to run the next code chunk.

#< task
# 'lfe' has to be loaded first.
library(lfe)

# Then 'wind' is regressed on the dummy variables 'day_fe' and 'hour_fe'.
# The residuals from the regression are extracted and saved as 'wind_des'.
# This is the deseasonalised variable for hourly wind generation.
wind_des = resid(felm(wind ~ 0 | day_fe + hour_fe, data=data_fe))

# Finally, part of 'wind_des' is displayed.
head(wind_des)
#>

< info "Estimating Linear Models with Fixed Effects Using 'felm()'"

The function felm() from the package lfe allows to estimate linear regression models with fixed effects. The syntax is similar to the one for lm():
felm(dependentvariable ~ explanatoryvariables | dummyvariables , data = ...)
The right-hand side of the formula now has an additional part, in which the dummy variables are specified. Since we only have dummy variables in our application, the part for the explanatory variables only contains 0.

>

wind_des is the new, deseasonalised variable for the hourly wind generation. Performing the regression and replacing the original variable with its residuals $\hat{\varepsilon}_t$ has removed the part of the fluctuation, which can be explained by the seasonal effects. Our estimations for the effect of wind generation on emissions will be based on the residual variation in both the dependent and the explanatory variables.

All deseasonalised variables are provided in a new datasets which we will use from now on. They are taken from Novan's original dataset and will be labelled by adding _tr at the end of the variable name (e.g. tx_co2 will now be called tx_co2_tr). The new dataset is called data.rds. Load the dataset and save it as data. Then show ten random rows.

#< task
# Load the dataset and show ten rows.
#>
data = readRDS("data.rds")
sample_n(data, 10)

You can see that data is similar to data_theory, but the deseasonalised variables tx_co2_tr, wind_tr, load_tr, load2_tr, load3_tr, spp_load_tr, spp_load2_tr and spp_load3_tr have replaced the original variables. In addition to the CO$_2$ emissions, the dataset now contains hourly data for NO$_x$ and SO$_2$ (both in $\text{lbs}$) as well. It is stored in columns tx_nox_tr and tx_so2_tr, respectively, and is provided by the EPA (cf. section II in Novan's paper). The dataset also includes a number of additional variables, which we will use in the course of this problem set. They will be explained later on.

Let's use this dataset for regressing the emissions of CO$_2$ on all explanatory variables, i.e. ERCOT wind production and all demand variables, taking into account the seasonal effects. The code is already provided in the code chunk below. Just press check.

#< task
# CO2 emissions are regressed on all explanatory variables according to the
# model in (2.4-1) using the deseasonalised variables from 'data'.
# The regression results are saved as 'em_co2_tx'.

em_co2_tx = lm(tx_co2_tr ~ wind_tr+load_tr+load2_tr+load3_tr+
                  spp_load_tr+spp_load2_tr+spp_load3_tr, data = data)
#>

Run this code chunk to produce a table with the regression results.

#< task_notest
# These lines load the package 'stargazer' and produce a table with the
# regression results.
library(stargazer)

stargazer(em_co2_tx,
          type="html", style="aer",
          keep.stat=c("adj.rsq","n"),
          covariate.labels = c("ERCOT Wind [MWh]", "ERCOT Demand [MWh]",
                               "Squares ERCOT Demand", "Cubes ERCOT Demand",
                               "SPP Demand [MWh]", "Squares SPP Demand",
                               "Cubes SPP Demand"),
          dep.var.labels = "CO2 [tons]",
          star.cutoffs = c(0.05, 0.01),
          notes = "**Significant at the 1% level. <br> *Significant at the 5% level.",
          notes.append = FALSE)
#>

The regression with deseasonalised data yields $\hat{\beta}_1=-0.628\,\text{tons/MWh}$. The estimated impact of wind generation on CO$_2$ emissions increased by about $9\%$ compared to the estimate from the regression with the original data.

Before turning to the next exercise in our theory part, we should take a look at how deseasonalising the variables has affected the autocorrelation function of the residuals.

#< task
# Plot the autocorrelation function for the residuals of the regression
# `em_co2_tx` with maximum lag 100.

acf(resid(em_co2_tx), lag.max=100)
#>

We see that the autocorrelation is considerably lower than before. However, it does not disappear completely by deseasonalising the data.
To explore further what kind of autocorrelation is still present in the residuals, let's compute the partial autocorrelation function (PACF). It describes the autocorrelation between two observations in a time series after removing the effects of all other observations between the two points in time (Brockwell & Davis (2002)). Run the next code chunk to show the PACF.

#< task
# Compute the PACF for the residuals of 'em_co2_tx'.
pacf(resid(em_co2_tx), lag.max=100)
#>

What is left in the residuals of em_co2_tx is mainly the first-order autocorrelation, while the higher orders are fairly small. It makes sense that the observations are highly correlated with the preceding observations and also that removing seasonal effects cannot get rid of this correlation.
While there are other methods to handle autocorrelation, we will follow Novan's approach and not apply those here. The main problem with autocorrelated residuals is that this affects the standard errors in our regression. We will learn in the next exercise how these can be treated to make them robust despite the remaining autocorrelation in the residuals.

Exercise 2.5 -- Building the Model: Autocorrelation and Newey-West Standard Errors

So far we only interpreted the coefficient estimates from our regressions. Now let's find out a bit more about the quality of the results. For that, we will take a look at the standard errors of our estimator $\hat{\beta}$.

a) Standard Errors

The standard error is a measure for how precise an estimator is. It is also the basis for test statistics of regressions (we will come back to those later on in this exercise). In order to calculate the standard error of an OLS estimator, we need its variance-covariance matrix. In the case that all assumptions for the CLR model are met, it is defined as (Kennedy (2008, chapter 3)): $$Var(\hat{\beta})=\sigma^2(X'X)^{-1}$$ Unfortunately, we usually do not know $\sigma^2$, the variance of the error term. Instead, we have to resort to the estimator $$\hat{\sigma}^2=\frac{1}{T-K} \sum_{t=1}^T \hat{\varepsilon}t^2$$ where $T-K$ is the number of degrees of freedom. We can use this to devise a formula for the estimator of the variance-covariance matrix of $\hat{\beta}$: $$\hat{Var}(\hat{\beta})=\hat{\sigma}^2(X'X)^{-1}$$ The standard error for the estimator $\hat{\beta}_k$, which we denote $se(\hat{\beta}_k)$, is defined as the square root of the $k$'th diagonal element of the estimated variance-covariance matrix (Greene (2012, chapter 4.3.7)): $$se(\hat{\beta}_k)=\sqrt{[\hat{Var}(\hat{\beta})]{kk}}$$

We learned in the last exercise how to remove seasonal effects from our data in order to improve our regression model. Unfortunately, we are still left with autocorrelation in the error terms, as we could see in the last ACF plot and from the PACF. This is a problem in that the standard errors for the estimators $\hat{\beta}_k$, as defined above, are not applicable, they are biased (Wooldridge (2013, chapter 12.1)). Luckily, we can still use the estimators themselves, but have to treat their standard errors separately before we can draw inferences from them.

b) The Newey-West Estimator

In order to get robust standard errors, we have to use a different estimator for the variance-covariance matrix. Novan uses an estimator proposed by Newey & West (1987), which is both heteroskedasticity- and autocorrelation-consistent (HAC): $$\hat{Q}=S_0+\frac{1}{T} \sum_{l=1}^L \sum_{t=l+1}^T w_l \hat{\varepsilon}t \hat{\varepsilon}{t-l}(x_t x'{t-l} + x{t-l} x_t')$$ where $$S_0=\frac{1}{T} \sum_{t=1}^T \hat{\varepsilon}t^2 x_t x_t' $$ Note that the equation closely follows the notation of Greene (2012, chapter 20.5.2) in order to adhere to the notation we have used so far. $L$ is the lag up to which the disturbance term is assumed to be autocorrelated and the factor $w_l$ is a set of weights, defined as $$w_l = 1-\frac{l}{L+1}$$ $\hat{Q}$ is called the Newey-West (NW) estimator. Crucial to the calculation of $\hat{Q}$ is the choice of $L$. Unfortunately, the form of the autocorrelation is often unknown. Newey & West (1987) demand $L$ to be dependent on the sample size $T$, but the exact form of the relation remains unclear and the suggestions in literature vary. According to Greene (2012, chapter 20.5.2), the usual choice is $L \approx T^{1/4}$. Stock & Watson (2007, chapter 15.4) point out that one can also incorporate ones knowledge about the amount of autocorrelation in the time series to decide on a value for $L$.
The HAC standard error for $\beta_k$ is, as before, the square root of the $k$'th diagonal element of the estimated variance-covariance matrix, which is now the NW estimator. The Newey-West standard error $se_{NW}(\hat{\beta}_k)$ is then: $$se_{NW}(\hat{\beta}k)=\sqrt{[\hat{Q}
]_{kk}}$$

c) Adjusting the Standard Errors

The function lm(), which we use for estimating linear regressions in R, computes the ordinary standard errors and does not correct the variance-covariance matrix for autocorrelation and heteroskedasticity in the error terms. But we do not have to compute the NW estimator ourselves. This can be done using the function NeweyWest() from the package sandwich (for more information, see the info box below). It uses the output from a regression to compute the NW estimator, which can then be used to calculate the adjusted standard errors.
Let's apply this to our regression for CO$_2$. The next code chunk reads in the dataset data.rds again and performs the regression of tx_co2_tr on all deseasonalised explanatory variables, as specified in (2.4-1). Just click edit and the check.

#< task
# Load `data.rds` and perform the regression for the CO2 emissions in Texas.
data = readRDS("data.rds")
em_co2_tx = lm(tx_co2_tr ~ wind_tr+load_tr+load2_tr+load3_tr+
                  spp_load_tr+spp_load2_tr+spp_load3_tr, data = data)
#>

< info "Estimating Robust Variance-Covariance Matrix Estimators with 'sandwich'"

The package sandwich comprises functions for computing different robust estimators of the variance-covariance matrix in regressions. They span various applications like models with autocorrelation, heteroskedasticity or both. We are using the function NeweyWest(), which computes the variance-covariance matrix as described in Newey & West (1987).

>

Now load the package sandwich and compute the NW estimator of the variance-covariance matrix using NeweyWest(). The input for the function is em_co2_tx. We also have to specify the number of lags, $L$, which is done with the argument lag = .... Novan uses $L=24$. Also, the argument prewhite = ... has to be set to FALSE in order to follow Novan's approach. Save the results as NW_co2_tx.

#< task_notest
# Load 'sandwich' and compute the NW estimator for the regression 'em_co2_tx'.
# Save the result as 'NW_co2_tx'.
#>

library(sandwich)
NW_co2_tx = NeweyWest(em_co2_tx, lag=24, prewhite=FALSE)

< award "Newey-West Estimator"

Well done, you estimated the HAC Newey-West Estimator using Newey-West()!

>

Run the next code chunk to show the NW estimator.

#< task
# Show the NW variance-covariance matrix.

print(NW_co2_tx)
#>

The variance of the $\beta_k$ is on the diagonal while the rest of the matrix is filled with the covariances. Now we just have to take the square root of each diagonal element of NW_co2_tx in order to get the HAC standard errors. The diagonal elements of a matrix can be extracted using the function diag(). Applying sqrt() on the result takes the square root of all elements individually. Do the computation in one command without saving any intermediate results. Save the end result as se_co2_tx.

#< task
# Compute the NW standard errors.
#>

se_co2_tx = sqrt(diag(NW_co2_tx))

The code in the next chunk creates a table which shows the results from the regression em_co2_tx, once with the ordinary standard errors and once with the NW standard errors. The table does not look very neat because it displays more digits than usually. But this is necessary to show the difference between the two types of standard errors.

#< task_notest
# Create a table showing the results from 'em_co2_tx' first with the ordinary
# standard errors and then with NW standard errors.

stargazer(em_co2_tx, em_co2_tx,
          type="html", style="aer",
          se = list(NULL, se_co2_tx),
          digits = NA,
          keep.stat=c("adj.rsq","n"), 
          covariate.labels = c("ERCOT Wind [MWh]", "ERCOT Demand [MWh]",
                               "Squares ERCOT Demand", "Cubes ERCOT Demand",
                               "SPP Demand [MWh]", "Squares SPP Demand",
                               "Cubes SPP Demand"), 
          dep.var.labels = "CO2 [tons]", 
          column.labels = c("Normal SE", "HAC SE"),
          model.numbers = FALSE, 
          star.cutoffs = c(0.05, 0.01), 
          notes = "**Significant at the 1% level. <br> *Significant at the 5% level.", 
          notes.append = FALSE)
#>

The table shows how the standard errors changed after replacing the ordinary estimator of the variance-covariance matrix with the NW estimator.

We are not quite finished yet with adjusting the standard errors. When it comes to the degrees of freedom, the function lm() only takes the number of explanatory variables into account. But since we are using deseasonalised variables in the regression, it has less degrees of freedom and we have to adjust our standard errors accordingly (cf. Novan's Stata code available at Novan (2015b)). Recall that we have $9660$ hourly and $1825$ daily fixed effects. The number of degrees of freedom is reduced by the number of fixed effects. When computing the standard errors, this has to be considered by multiplying each NW standard error with a factor $\sqrt{\frac{T-K-1}{T-K-1825-9660}}$. However, you do not have to do this at this point. In fact, from now on you won't have to compute the adjusted NW standard errors yourself, but they will be stated when showing regression results. While it is important for you to know how the NW standard errors are computed and adjusted, it is more educational to focus on other parts of the analysis.

d) Significance

We have everything in order to estimate the average CO$_2$ emissions that are offset by wind energy in the same way as Novan has done in his paper: we know the model including all explanatory variables, we adjusted our variables for seasonal effects and the standard errors are robust to heteroskedasticity and autocorrelation in the error terms. We can now attend to the last part in the regression tables, which we ignored so far: significance.

To judge whether or not the estimates for the coefficients in our regression are significant, we are using a hypothesis test. Different hypothesis tests can be used for various assumptions about characteristics of the regression model, but they all work according to the same concept.
First, a null hypothesis $H_0$ and an alternative hypothesis $H_1$ are formulated (Wooldridge (2013, chapter 4.2)). For our application, the following null hypothesis is used: $$H_0:\ \beta_k=0$$ If the null hypothesis is true, the explanatory variable $k$ does not have any effect on the dependent variable and its coefficient $\beta_k$ is zero. The alternative hypothesis is then $$H_1:\ \beta_k \neq 0$$ It states that the dependent variable is influenced by the explanatory variable $k$.
Once the null hypothesis and the alternative hypothesis have been set up, the goal is to decide whether the null hypothesis can be rejected in a specific application. The basis for testing this is always a test statistic. For our null hypothesis, the $t$-distribution is used and the test is therefore referred to as the $t$-test (Wooldridge (2013, chapter 4.2)). The $t$-value for $\hat \beta_k$ is calculated as follows: $$t_{\hat \beta_k}=\frac{\hat \beta_k}{se(\hat \beta_k)}$$ This shows again why it is important to use the Newey-West standard errors instead of the ordinary ones: the test statistic would otherwise be faulty.
The test statistic and its realisation in a specific application do not suffice to decide whether we can reject the null hypothesis. Additionally, the $p$-value is calculated (Wooldridge (2013, chapter 4.2)). It is the probability to get the realised value of the test statistic (or a more extreme value) given that the null hypothesis holds. Of course, we will never get a $p$-value of zero, which would tell us that it is impossible to get such a $t$-value when the null hypothesis is true. Instead, the obtained $p$-value is compared to a chosen significance level $\alpha$. If the $p$-value is below this confidence level, we reject the null hypothesis. In this problem set we always use two significance levels in our regressions, just as Novan does in his paper: $1\%$ and $5\%$.
Luckily, R does all the computations for us, so we do not have to compute the $t$- and $p$-values ourselves. The number of stars behind the estimates in a regression table indicate whether the estimated coefficients are significant and at which level. The coefficient $\hat{\beta}_1$ in em_co2_tx, for example, is significant at the $1\%$ level, as shown by the two stars behind its value in the table above.

< info "Do We Have Heteroskedasticity in Our Model?"

This is an important question because a heteroskedastic error term violates the third assumption for the CLR model, just like autocorrelation does. In his paper, Novan does not mention whether he is assuming heteroskedasticity in his model. But he uses an estimator for the variance-covariance matrix, which is both autocorrelation- and heteroskedasticity-consistent. This suggests that possibly he can not rule out that the error terms might be heteroskedastic.
Plotting the residuals $\hat{\varepsilon}$ versus the fitted values $\hat y$ indicates homoskedasticity, as you can see below. The residuals seem evenly scattered around their mean for all values of $\hat y$ and there is no noticeable trend in their variance.

# Load the dataset and perform the regression according to (2.4-1).
data = readRDS("data.rds")
em_co2_tx = lm(tx_co2_tr ~ wind_tr+load_tr+load2_tr+load3_tr+
                  spp_load_tr+spp_load2_tr+spp_load3_tr, data = data)

# Retrieve the fitted values from 'em_co2_tx' and save them as 'y_hat'.
y_hat = fitted(em_co2_tx)
# Then retrieve the residuals and save them as 'eps'.
eps_hat = resid(em_co2_tx)
# Finally, plot the fitted values versus the residuals.
plot(y_hat, eps_hat)

We can check the graphical result by regressing $|\hat{\varepsilon}|$ on $\hat y$, as done below.

# This regresses the residuals' absolute values on the fitted values and shows
# summary statistics from the regression.
summary(lm(abs(eps_hat) ~ y_hat))

The regression yields a coefficient of $-0.003$, but is not significant. This suggests that, if the residuals are heteroskedastic at all, the heteroskedasticity is very weak.

>

In this exercise, you learned how to deal with heteroskedasticity and autocorrelation in the disturbance term and how to interpret the significance of regression results (see the info box above for some additional remarks on heteroskedasticity in our model). We know Novan's model for estimating the impact of wind energy on pollution and how to interpret results from a regression. In the next exercise, we will apply this knowledge when estimating the impact of ERCOT wind generation on emissions of CO$_2$, NO$_x$ and SO$_x$ from fossil fuel power plants and discussing the results.

Exercise 2.6 -- Building the Model: Average Emissions Offset by Wind Production

In the course of the last five exercises, we derived step by step the regression model Novan uses in his paper to estimate the impact of wind energy in Texas on the emission of pollutants from fossil fuel power plants. Here again the full model equation as in (2.4-1): $$E_t = \beta_0 + \beta_1 \cdot W_t + \sum_{n=1}^3 (\theta_{1,n} \cdot D_{1,t}^n + \theta_{2,n} \cdot D_{2,t}^n) + \alpha_{h,m,y,w} + \delta_d + \varepsilon_t$$

a) The Impact of Wind Power on Emissions in Texas

So far we only performed the regression for CO$_2$ emitted by power plants in Texas, but of course we also want to estimate the effect on emissions of NO$_x$ and SO$_2$.

Run the next two code chunks to load data.rds once more and perform the regressions for all three pollutants. The explanatory variables are the same, only the dependent variable changes. Recall that the deseasonalised aggregate hourly NO$_x$ and SO$_2$ emissions from ERCOT fossil fuel-fired generators are stored as tx_nox_tr and tx_so2_tr, respectively. The results are saved in em_co2_tx, em_nox_tx and em_so2_tx.

#< task
# Load 'data.rds' and save it as 'data'.

data = readRDS("data.rds")
#>
#< task
# Perform the regression specified in (2.4-1) for each of the three pollutants.

em_co2_tx = lm(tx_co2_tr ~ wind_tr+load_tr+load2_tr+load3_tr+
               spp_load_tr+spp_load2_tr+spp_load3_tr, data)

em_nox_tx = lm(tx_nox_tr ~ wind_tr+load_tr+load2_tr+load3_tr+
               spp_load_tr+spp_load2_tr+spp_load3_tr, data)

em_so2_tx = lm(tx_so2_tr ~ wind_tr+load_tr+load2_tr+load3_tr+
               spp_load_tr+spp_load2_tr+spp_load3_tr, data)
#>

Run the next chunk to compute the Newey-West standard errors and produce a table with the results from all three regressions. The table will only show the coefficients for wind generation since these are the ones we are interested in. The other explanatory variables are merely controls and will not be presented in detail.

#< task_notest
# Compute the degrees of freedom-adjusted NW standard errors.
library(sandwich)
se_co2_tx = sqrt(diag(NeweyWest(em_co2_tx, lag=24, prewhite=FALSE)) *
                   (43794-7-1)/(43794-7-1825-9660))
se_nox_tx = sqrt(diag(NeweyWest(em_nox_tx, lag=24, prewhite=FALSE)) *
                   (43794-7-1)/(43794-7-1825-9660))
se_so2_tx = sqrt(diag(NeweyWest(em_so2_tx, lag=24, prewhite=FALSE)) *
                   (43794-7-1)/(43794-7-1825-9660))

# Load stargazer and create the table with regression results.
library(stargazer)
stargazer(em_co2_tx, em_nox_tx, em_so2_tx,
          type="html", style="aer",
          title = "Estimated Impact of ERCOT Wind Energy on Emissions from ERCOT Power Plants",
          se = list(se_co2_tx, se_nox_tx, se_so2_tx),
          keep = "wind_tr",
          add.lines = list(c("ERCOT Demand", "Yes", "Yes", "Yes"),
                           c("SPP Demand", "Yes", "Yes", "Yes"),
                           c("Hourly FE", "Yes" ,"Yes", "Yes"),
                           c("Constant", "Yes", "Yes", "Yes")),
          keep.stat=c("adj.rsq","n"), 
          covariate.labels = c("ERCOT Wind [MWh]"),
          dep.var.labels = c("CO2 <br> [tons]", "NOx <br> [lbs]", "SO2 <br> [lbs]"),
          model.numbers = FALSE,
          star.cutoffs = c(0.05, 0.01), 
          notes = "**Significant at the 1% level. <br> *Significant at the 5% level.", 
          notes.append = FALSE)
#>

< quiz "Interpreting Regression Coefficients"

question: What does the estimated coefficient for the effect of Texan wind generation on NOx emissions mean?

sc: - When wind generation increases by 1%, additional 0.900% of NOx emissions are avoided on average. - When wind generation increases by 1%, additional 0.900 lbs of NOx emissions are avoided on average. - When wind generation increases by 1 MWh, additional 0.900% of NOx emissions are avoided on average. - When wind generation increases by 1 MWh, additional 0.900 lbs of NOx emissions are avoided on average.* success: Great, your answer is correct. failure: Your answer is not correct. Take another look at the model above.

>


The regression yields that, on average, one $\text{MWh}$ of ERCOT wind generation offsets $0.628\,\text{tons}$ of CO$_2$, $0.900\,\text{lbs}$ of NO$_x$ and $1.774\,\text{lbs}$ of SO$_2$ from ERCOT fossil fuel-fired generators. Compared to the average hourly emissions over the whole observation period, this corresponds to about $0.002\%$ for CO$_2$, $0.003\%$ for NO$_x$ and $0.002\%$ for SO$_2$. All results are significant at the $1\%$ level.

b) The Impact of Wind Power on Emissions in Texas and Oklahoma

The above results are useful in that we can estimate the effect of wind energy on pollution. But so far we only considered the impact on power plants that are directly connected to the ERCOT grid. We already include the demand in the SPP region in our model since the ERCOT and SPP grid are connected. So it seems plausible that wind generation should affect the production of SPP fossil fuel power plants as well. That is why Novan estimates the model in (2.4-1) not only for ERCOT emissions, but also for ERCOT and SPP emissions combined. We will do the same now to see whether wind generation in Texas influences emissions of power plants outside of the ERCOT grid as well.

To examine the effect of Texan wind energy on emissions from power plants both in Texas and Oklahoma, we only have to change the dependent variable in our regression model. In our dataset, the deseasonalised aggregate hourly CO$_2$, NO$_x$ and SO$_2$ emissions from ERCOT and SPP fossil fuel-fired generators are stored as txok_co2_tr, txok_nox_tr and txok_so2_tr, respectively. The data on SPP emissions are provided, like the ERCOT emissions, by the EPA (cf. section II in Novan's paper). Run the next code chunk to perform the regressions.

#< task
# Perform the regressions for the aggregate emission of CO2, NOx and SO2
# from both ERCOT and SPP power plants. The results are saved as 'em_co2_txok',
# 'em_nox_txok' and 'em_so2_txok'.

em_co2_txok = lm(txok_co2_tr ~ wind_tr+load_tr+load2_tr+load3_tr+
               spp_load_tr+spp_load2_tr+spp_load3_tr, data)

em_nox_txok = lm(txok_nox_tr ~ wind_tr+load_tr+load2_tr+load3_tr+
               spp_load_tr+spp_load2_tr+spp_load3_tr, data)

em_so2_txok = lm(txok_so2_tr ~ wind_tr+load_tr+load2_tr+load3_tr+
               spp_load_tr+spp_load2_tr+spp_load3_tr, data)
#>

The following code chunk computes the adjusted NW standard errors and produces a table showing the results from all six regressions. Just press check.

#< task_notest
# Compute the degrees of freedom-adjusted NW standard errors.
se_co2_txok = sqrt(diag(NeweyWest(em_co2_txok, lag=24, prewhite=FALSE)) *
                   (43794-7-1)/(43794-7-1825-9660))
se_nox_txok = sqrt(diag(NeweyWest(em_nox_txok, lag=24, prewhite=FALSE)) *
                   (43794-7-1)/(43794-7-1825-9660))
se_so2_txok = sqrt(diag(NeweyWest(em_so2_txok, lag=24, prewhite=FALSE)) *
                   (43794-7-1)/(43794-7-1825-9660))

# Create a table with the results from all six regressions.
stargazer(em_co2_tx, em_nox_tx, em_so2_tx,
          em_co2_txok, em_nox_txok, em_so2_txok,
          type="html", style="aer",
          title = "Estimated Impact of ERCOT Wind Energy on Emissions <br> (1) - (3): from ERCOT Power Plants <br> (4) - (6): from ERCOT + SPP Power Plants",
          se = list(se_co2_tx, se_nox_tx, se_so2_tx,
                    se_co2_txok, se_nox_txok, se_so2_txok),
          keep = "wind_tr",
          add.lines = list(c("ERCOT Demand", "Yes", "Yes", "Yes", "Yes", "Yes", "Yes"),
                           c("SPP Demand", "Yes", "Yes", "Yes", "Yes", "Yes", "Yes"),
                           c("Hourly FE", "Yes" ,"Yes", "Yes", "Yes", "Yes", "Yes"),
                           c("Constant", "Yes", "Yes", "Yes", "Yes", "Yes", "Yes")),
          keep.stat=c("adj.rsq","n"), 
          covariate.labels = c("ERCOT Wind [MWh]"),
          dep.var.labels = c("CO2 <br> [tons]", "NOx <br> [lbs]", "SO2 <br> [lbs]", "CO2 <br> [tons]", "NOx <br> [lbs]", "SO2 <br> [lbs]"),
          star.cutoffs = c(0.05, 0.01), 
          notes = "**Significant at the 1% level. <br> *Significant at the 5% level.", 
          notes.append = FALSE)
#>

The regressions yield higher values for $\hat{\beta}_1$ for all three pollutants: one $\text{MWh}$ of ERCOT wind energy is estimated to reduce emissions from power plants in the ERCOT and SPP grid by $0.673\,\text{tons}$ for CO$_2$, $1.053\,\text{lbs}$ for NO$_x$ and $1.824\,\text{lbs}$ for SO$_2$. The results are significant at the $1\%$ level. Checking whether the confidence intervals of each pair of estimates overlap, reveals that the increased effect for CO$_2$ and NO$_x$ is significant, but the one for SO$_2$ is not. This means that the estimated average emissions of CO$_2$ and NO$_x$ offset by wind energy are significantly higher when considering emissions from power plants both in the ERCOT grid in Texas and the connected SPP grid in Oklahoma. This suggests that wind energy from Texas reduces output and emissions not only from fossil fuel power plants in Texas, but also from those connected to the grid in Oklahoma.

c) Are All Affected Power Plants Considered in the Model?

Before moving on to the next exercise, we should ask ourselves whether we included all fossil fuel units in the regression, which are substitutes for ERCOT wind turbines and are therefore reducing their output in response to wind production. In section III.C in his paper, Novan estimates the average output from non-wind generating units that is offset due to the output of wind energy in Texas. He exploits the fact that the electricity supplied to a grid always has to equal demand to ensure grid stability. One $\text{MWh}$ of wind energy should offset an equal amount of non-wind electricity. If the regressions yield that less than one $\text{MWh}$ is avoided, then generating units which reduce their output in response to wind energy, are missing from the dataset. This would imply that our estimates for the offset emission of pollutants underestimate the true effect. Then they would be biased downward in absolute values. If the results suggest that more than one $\text{MWh}$ are avoided, this could mean that fossil fuel power plants reduce their output due to wind generation outside of the ERCOT grid, which we cannot directly control for with the available data. This would result in our computations overestimating the effect of ERCOT wind generation on pollution.
Novan uses the same model as in (2.4-1), only replacing the emissions by the generation by fuel type. He finds that, on average, wind energy replaces mostly electricity from fossil fuel-fired generators and only slightly more than one $\text{MWh}$ of total production. He concludes that all relevant units are included in the dataset and that the estimates for offset emissions are not biased upwards.

In this exercise we estimated the average impact of one $\text{MWh}$ of ERCOT wind power on emissions from conventional power plants. In the next exercise we will study the impact of the marginal $\text{MWh}$ of wind generation on pollution.

Exercise 3.1 -- Marginal Emissions Avoided: The Model

So far we only considered the average effect of one $\text{MWh}$ of electricity from wind turbines on emissions from conventional power plants. But the influence of wind energy is not the same in every situation.


< quiz "Heterogeneity"

question: Which factors are likely to influence how much pollution is offset by one MWh of wind energy? Multiple answers are correct. mc: - Level of demand - Level of output from renewable energies - Market* success: Indeed, all three answers are correct! We will show in the following, how the amount of pollution avoided in response to one more MWh of wind energy depends on levels of demand and wind generation. Later on in exercise 6.2, we will see that the pollution reduction depends on the type of renewable technology as well. Furthermore, other papers found that marginal emissions also depend on the market. We will continue to focus on the Texan market, but come back to this point in exercise 7. For now, let's investigate the heterogeneity of wind energy within the ERCOT market. failure: Your answer is not correct. Try again.

>


a) Modelling the Marginal Emissions Avoided

In order to analyse the marginal effect of wind energy on pollution, Novan estimates a model which incorporates the interaction between wind generation and demand in Texas:

$$E_t = \beta_0 + \sum_{i=1}^3 \sum_{j=0}^3 \beta_{i,j} W_t^i D_{1,t}^j + \sum_{n=1}^3 (\theta_{1,n} D_{1,t}^n + \theta_{2,n} D_{2,t}^n) + \alpha_{h,m,y,w} + \delta_d + \varepsilon_t \qquad (3.1-1)$$

where, as before: - $E$ is CO$2$, NO$_x$ or SO$_2$ emissions, - $W$ is ERCOT wind production, - $D_1$ is ERCOT demand and - $D_2$ is SPP demand - $\alpha{h,m,y,w}$ represents the hourly fixed effect and - $\delta_d$ is the daily fixed effect.

The model looks similar to the one we estimated in the previous exercises, but this time it contains terms not only for wind production, but also for wind production raised to the second and third power as well as interaction terms of wind production raised to the $i$'th power ($i \in {1,2,3}$) and ERCOT demand raised to the $j$'th power ($j \in {0,1,2,3}$). The fixed effects are again contained in the deseasonalised variables and we perform one regression for each of the three pollutants. Novan uses the aggregate hourly CO$_2$, NO$_x$ and SO$_2$ emissions from both ERCOT and SPP fossil fuel-fired generators as independent variables.

Taking the derivative of the emissions $E_t$ with respect to wind output $W$ in equation (3.1-1), yields the marginal emissions avoided (MEA):

$$MEA(W,D)= \frac{-\partial E(W,D)}{\partial W}=-\sum_{i=1}^3 \sum_{j=0}^3 i\cdot \beta_{i,j} \cdot W^{i-1} \cdot D^j \qquad (3.1-2)$$

The MEA measure the emission reduction in response to the next $\text{MWh}$ of wind output and are a function of the level of ERCOT wind production and demand. As you can see, the equation also contains the regression coefficients of the interaction terms $\beta_{i,j}$ from (3.1-1).

If you are only interested in seeing and discussing the results from the computations, you may skip the rest of this exercise and continue with exercise 3.2.

b) Computing the Marginal Emissions Avoided in R

Let's go through the whole process of computing $MEA(W,D)$ for only one of the pollutants, CO$_2$, at first. Start by performing the linear regression according to the model described in equation (3.1-1). The dataset data.rds contains the deseasonalised variables for the square and cube of wind production, wind2_tr and wind3_tr as well as the interaction variables. The variable for $W^2D^3$, for example, is called wind2_load3_tr. Since you already performed similar regressions several times before and entering all nineteen variables is a tedious task, the code is provided below. Run the next two code chunks.

#< task
# Load 'data.rds'.

data = readRDS("data.rds")
#>
#< task
# Perform the regression specified in equation (3.1-1) for CO2.

mgem_co2 = lm(txok_co2_tr ~
                wind_tr + wind2_tr + wind3_tr +
                wind_load_tr + wind2_load_tr + wind3_load_tr + 
                wind_load2_tr + wind2_load2_tr + wind3_load2_tr +
                wind_load3_tr + wind2_load3_tr + wind3_load3_tr +
                load_tr + load2_tr + load3_tr +
                spp_load_tr + spp_load2_tr + spp_load3_tr, data)
#>

We have to obtain the estimates for $\beta_{i,j}$ from the regression results in order to compute equation (3.1-2). The coefficients from a fitted object (like the one that is returned by lm()) can be accessed using the command coef(). Show the coefficients from mgem_co2.

#< task
# Retrieve the coefficients from 'mgem_co2'.
#>

coef(mgem_co2)

As you can see, the estimates for the twelve coefficients $\beta_{i,j}$ are stored in positions 2 to 13, in accord with the variables' order in the call to lm(). If only certain elements of an object are required, their position numbers can be specified in square brackets behind the object. Several successive elements starting at a and ending at b are accessed using a:b. So if we wanted to access elements 3-5 in an object, the command would look like this: object[3:5]. Store the required coefficient estimates in beta_co2_vec and show them.

#< task
# Use one line of code to access estimates 2 to 13 from 'mgem_co2'
# and save them as 'beta_co2_vec'.

# Then show 'beta_co2_vec'.
#>

beta_co2_vec = coef(mgem_co2)[2:13]
beta_co2_vec

beta_co2_vec is a vector consisting of the twelve estimates which we need for computing the marginal emissions avoided for CO$_2$. For the computations below, it is convenient to store the coefficients in a matrix. If you are not familiar with creating matrices in R, read through the info box below. The estimates should be ordered according to the powers of wind and load, so we need three rows for wind and four columns for load and have to fill the matrix by column. Create a matrix from beta_co2_vec and save it as beta_co2.

#< task
# Create a matrix `beta_co2` with the required regression coefficients.
#>

beta_co2 = matrix(beta_co2_vec, nrow=3, ncol=4)

Run the next chunk to show beta_co2 with the function print() and confirm that the values are ordered correctly.

#< task
# 'beta_co2' is displayed to confirm the correct order of the values.

print(beta_co2)
#>

< info "Creating matrices with 'matrix()'"

We can create a matrix using the base-function matrix(). The basic structure of a call to matrix() looks as follows:
matrix(data=..., nrow=..., ncol=..., ...)
data can either be a vector of values that are to be made into a matrix or an object containing the values. The arguments nrow and ncol specify the number of rows and columns in the matrix. Additional arguments which you may specify are byrow and dimnames. The former is set to FALSE by default, making the function fill the matrix by columns. Setting it to TRUE fills the matrix by row. The latter sets the row and column names in the matrix.
The following example creates a $3\times 2$ matrix filled with the numbers one through six.

matrix(1:6, nrow=3, ncol=2)

You can see how, since we did not specify byrow, the matrix is by default filled by column.

>

Consult once more equation (3.1-2). It consists of twelve summands, in each of which we have to plug in the respective $\beta$ as well as $W$ and $D$ raised to the $i$'th and $j$'th power, respectively. We could do that by just spelling out all summands in full in a command, but that is a very tedious task and not that educational either. Instead, we will use a for-loop. Some of you might already be familiar with them from other programming languages. If you have not yet seen for-loops in R (or at all), read through the info box below.

< info "Repeating Commands With For-Loops"

For-loops are a convenient way to execute recurring tasks. A for-loop starts with the specification of an iteration index and which values it takes. The body contains code which is to be executed and is dependent on the iteration index. The for-loop will then run the code repeatedly, starting at the first value of the iteration index and changing it in each iteration to the next value until it reaches the last one.

In R, for-loops are initialised by for followed by the iteration index and the specification of its values in brackets. Then follow the computations that are to be performed in each iteration, in curly brackets. As an example, the next code chunk uses a for-loop to sum up all integers from one to five:

x = 0

for (i in 1:5) {
  x = x + i
}

x

Before the loop, a variable x is defined, which takes the starting value zero. The iteration index i is first set to one and added to x, which then takes the value one. Then i takes the value two, is added to x and so on until it reaches five.

>

Since $MEA(W,D)$ contains a double sum, we need two for-loops, one of which is nested in the other. Within those we compute a variable mea, which has to be set to zero at the beginning. Each iteration then computes one of the summands in equation (3.1-2) for $MEA(W,D)$ and subtracts it from last iteration's variable mea.
Let us compute the marginal CO$_2$ emissions avoided for a wind energy level $W=820\,\text{MWh}$ and a demand level $D=30,000\,\text{MWh}$. First, define W and D.

#< task
# Define 'W' and 'D'.
#>

W = 820
D = 30000

The code for the two nested for-loops is provided below. The loop for W uses the index i while j indexes the loop for D. We can access the element in row i and column j+1 of beta_co2 using the notation beta_co2[i,j+1]. Before the loops, the starting value of mea is set. Run the code chunk.

#< task
# These nested for-loops compute 'mea' using the values for 'W' and 'D' you
# defined above. The result is shown afterwards.

mea = 0

for (i in 1:3) {
  for (j in 0:3) {
    mea = mea - i*beta_co2[i,j+1]*W^(i-1)*D^j
  }
}

mea
#>

< quiz "MEA"

question: What dimensions does the result have?

sc: - tons - MWh - tons/MWh* - It is dimensionless. success: Great, your answer is correct. failure: Your answer is not correct. Try again.

>


The computations yield $MEA(820,30000)=0.65\,\text{tons/MWh}$. That is, when there are $820\,\text{MWh}$ of wind output and $30,000\,\text{MWh}$ of demand in Texas, the next $\text{MWh}$ of wind energy is estimated to reduce the emission of CO$_2$ in Texas and Oklahoma by $0.65\,\text{tons}$.

If we wanted to compute $MEA(W,D)$ for any other combination of wind generation and output, we would have to set new values for W and D and run the loops again. That is not a very efficient method (and econometricians do like efficiency). An easier way is writing a function ourselves, which takes values for the levels of wind generation and demand as well as a $\beta_{i,j}$-matrix as arguments, does the computations, which we did manually above, and finally returns the MEA. That way we only have to call the function with the wanted values each time we need to compute $MEA(W,D)$. If you have not yet seen how to write your own function in R, read through the following info box.

< info "Writing Functions in R"

Writing a new function in R can be easily done using function(). In the brackets you define which variables are to be passed to the function when calling it. Setting default values is possible as well. The code that is to be executed by your function, follows in curly brackets. It ends with the command return(), in which you specify the function's output. The function has to be assigned a name, which you will later use to call it. As an example, the following code chunk defines a simple function called square() which squares the number you pass to it.

square = function(x) {
  y = x^2
  return(y)
}

Once the function has been defined, you can call it at any time with the parameters of your choice.

square(5)

>

The following code chunk defines a function called mea(), which takes beta, W and D as input, then computes $MEA(W,D)$ just as we did above and returns the final result. The starting value for mea is set within the function now. You only have to run the chunk.

#< task
# This code chunk defines a function 'mea()' which computes MEA(W,D)
# just as we did above.

mea = function(beta,W,D) {
  mea = 0

  for (i in 1:3) {
    for (j in 0:3) {

      mea = mea - i*beta[i,j+1]*W^(i-1)*D^j

    }
  }
  return(mea)
}
#>

Great, this makes life a lot easier for us because we will use this function when plotting the marginal emissions avoided for different levels of wind generation. But first, let us test the function and verify that it gives the same result as our manual computation before. Call mea() with beta_co2 for the parameter beta and choose $W=820\,\text{MWh}$ and $D=30,000\,\text{MWh}$ again.

#< task
# Call mea() with the arguments specified above.
#>
mea(beta_co2, 820, 30000)

< award "Master of Marginal Emissions Avoided"

Congratulations, you learned everything necessary to compute and discuss the marginal emissions avoided and applied it successfully.

>

We do indeed get the same result as before: $MEA(820,30000)=0.65\,\text{tons/MWh}$.

If you are interested in seeing an alternative version of this function and learn about vectorising code, you can take a look at the exercise "Excursus", which you can find after exercise 3.2.

Now that you know how to compute the marginal emissions avoided in R, we will turn to carefully discussing its changes with demand and wind generation for each of the pollutants in the following exercise.

Exercise 3.2 -- Marginal Emissions Avoided: Heterogeneity and its Driving Factors

a) Marginal CO$_2$ Emissions Avoided

To get a more comprehensive understanding of the behaviour of the marginal CO$_2$ emissions that are avoided at different levels of wind production and demand, let's plot $MEA(W,D)$. First, run the following code chunk to load data.rds again.

#< task
# Load the dataset 'data.rds'.

data = readRDS("data.rds")
#>

The next code chunk plots $MEA(W,D)$ for different levels of wind generation (cf. Novan, figure 1; we use the same wind levels and demand range as Novan does - if you are interested in the reasoning behind his choice, you can take a look at p.306 in the paper). The plot is created using the self-written function plot_mea() in order to avoid many long code chunks in this exercise. The function runs the regression, extracts the required coefficients and then creates the plot using our function mea(). Those interested in how the function works, will find the code in the info box below. Run the next code chunk.

#< task_notest
# Plot the marginal CO2 emissions avoided using 'plot_mea()'.

plot_mea("co2", data)
#>

< info "'plot_mea()': A Function for Plotting MEA(W,D)"

This self-written function is used to plot $MEA(W,D)$ for CO$_2$, NO$_x$ or SO$_2$.

plot_mea = function(pol,data) {
  frml = formula(paste("txok_",pol,"_tr ~ 
                          wind_tr + wind2_tr + wind3_tr +
                       wind_load_tr + wind2_load_tr + wind3_load_tr + 
                       wind_load2_tr + wind2_load2_tr + wind3_load2_tr +
                       wind_load3_tr + wind2_load3_tr + wind3_load3_tr +
                       load_tr + load2_tr + load3_tr +
                       spp_load_tr + spp_load2_tr + spp_load3_tr", sep=""))
  mgem = lm(frml ,data)

  beta = matrix(coef(mgem)[2:13], nrow=3, ncol=4)

  if (pol=="co2") {
    title = "Marginal CO2 Emissions Avoided"
  } else if (pol=="nox") {
    title = "Marginal NOx Emissions Avoided"
  } else {
    title = "Marginal SO2 Emissions Avoided"
  }

  library(ggplot2)
  ggplot(data.frame(x = c(23500, 55000)), aes(x = x)) +
    stat_function(fun = mea, args = list(beta = beta, W = 125), aes(colour=" 125")) +
    stat_function(fun = mea, args = list(beta = beta, W = 820), aes(colour=" 820")) +
    stat_function(fun = mea, args = list(beta = beta, W = 1935), aes(colour="1935")) +
    stat_function(fun = mea, args = list(beta = beta, W = 3285), aes(colour="3285")) +
    xlab("Demand [MWh]") +
    ylab("MEA [tons / MWh]") +
    ggtitle(title) +
    theme(plot.title = element_text(hjust = 0.5)) +
    scale_colour_manual("Wind [MWh]", values = c("red", "blue", "green", "orange"))
}

>

The plot displays the estimated CO$_2$ emissions avoided due to a marginal increase in wind production. It reveals that the avoidance of CO$_2$ is in general high during hours of lower demand and decreases with increasing demand up to demand levels of about $35,000 - 45,000\,\text{MWh}$. Beyond this point, the emissions avoided by an additional $\text{MWh}$ of wind energy increases again. This level in demand is also crucial in that below circa $35,000\,\text{MWh}$, more CO$_2$ is avoided when the level of wind generation increases. Above $45,000\,\text{MWh}$ of demand, the highest MEA is estimated for the lowest level in wind generation and it decreases with increasing level of wind energy.

While it is useful to know how the MEA for CO$_2$ depend on the level of wind generation and demand, it would be even more interesting if we could explain and understand the behaviour. In exercise 1.1, we already came across the merit order. It describes which types of power plants start producing with increasing demand. Since wind turbines have basically no variable costs, they will usually generate electricity whenever possible, while conventional power plants produce the energy that is needed to meet the residual demand. Therefore it depends both on the level of wind energy and demand, which type of power plant is on the margin and reduces or stops its production in response to the next $\text{MWh}$ of wind energy.
But what does this mean for the emissions from fossil fuel power plants? The table below shows how the types and quantities of emitted pollutants vary on average for different types of fossil fuel power plants in our dataset. It displays the emission intensities for coal-fired power plants, combined cycle (CC) gas generators and gas turbines (GT).

Emission Intensities of Fossil Fuel Generators
Coal CC GT
Number of Units 89 147 387
Average Heat Rate [MMBtu/MWh] 10.36 7.64 12.38
(1.49) (0.80) (4.35)
Average CO2 Emission [tons/MWh] 1.06 0.46 0.74
(0.10) (0.07) (0.25)
Average NOx Emission [lbs/MWh] 2.57 0.26 2.23
(2.05) (0.38) (3.53)
Average SO2 Emission [lbs/MWh] 6.24 0.01 0.06
(5.75) (0.06) (0.41)

(Own representation based on table 2 in Novan's paper)

While CC and GT generators are both powered by gas, the two technologies do differ in their emission intensities. CC generators have the lowest emission rates for all three pollutants. They combine a gas turbine with a heat engine by powering the latter with the heat produced in the former (EIA (2017c)). This makes them more efficient compared to a simple gas turbine, which is reflected in the average heat rate in the table above. It describes how much fuel is needed to produce one $\text{MWh}$ of electricity and therefore is a measure for a power plant's efficiency, where a lower value corresponds to a more efficient power plant (EIA (2017d & e)).
Coal-fired power plants have the highest emission intensities for all three pollutants. They emit almost as much CO$_2$ as CC and GT generators in our dataset together. The NO$_x$ emissions are largely accounted for by coal and GT generators. And SO$_2$ is almost solely emitted by coal generators.

The table shows that the emissions depend greatly on the types of electricity generators producing at a given time. And this is exactly what is driving the heterogeneity in avoided emissions due to wind energy. So in order to explain the behaviour of the marginal emissions avoided, we should take a look at the marginal generation that is avoided due to wind energy.
To do so, we will use the gross ERCOT and SPP generation from coal, CC and GT generators as dependent variables in the model specified in equation (3.1-1). They are called txok_gload_coal_tr, txok_gload_cc_tr and txok_gload_gt_tr in our dataset. The explanatory variables stay the same as before. If you are not familiar with the term "gross generation", check the info box below. Then we use equation (3.1-2) again, but the results will now represent the marginal generation avoided. Since the procedure itself is nothing new and we are more interested in the results, the following three code chunks already contain all necessary commands and you merely have to run them. They will create a set of plots visualising the marginal generation avoided for coal, CC and GT generators as well as show once more the MEA for CO$_2$ (cf. Novan, figure 2; we use the same wind levels and demand range as Novan does again).

< info "Net vs. Gross Generation"

Gross generation is the total electricity generated at a power plant. Subtracting from this the electricity that is consumed at the power plant yields the net generation.
(EIA (2017a & b))

>

#< task
# The first block of code runs the regression for the generation from
# coal-fired power plants and extracts the estimates needed for computing
# the marginal generation avoided.

# The following two blocks of code do the same for CC and GT.

mggen_coal = lm(txok_gload_coal_tr ~
                wind_tr + wind2_tr + wind3_tr +
                wind_load_tr +  wind2_load_tr + wind3_load_tr + 
                wind_load2_tr + wind2_load2_tr + wind3_load2_tr +
                wind_load3_tr +wind2_load3_tr + wind3_load3_tr +
                load_tr + load2_tr + load3_tr +
                spp_load_tr + spp_load2_tr + spp_load3_tr, data)
beta_coal = matrix(coef(mggen_coal)[2:13], nrow=3, ncol=4)


mggen_cc = lm(txok_gload_cc_tr ~
                wind_tr + wind2_tr + wind3_tr +
                wind_load_tr +  wind2_load_tr + wind3_load_tr + 
                wind_load2_tr + wind2_load2_tr + wind3_load2_tr +
                wind_load3_tr +wind2_load3_tr + wind3_load3_tr +
                load_tr + load2_tr + load3_tr +
                spp_load_tr + spp_load2_tr + spp_load3_tr, data)
beta_cc = matrix(coef(mggen_cc)[2:13], nrow=3, ncol=4)


mggen_gt = lm(txok_gload_gt_tr ~
                wind_tr + wind2_tr + wind3_tr +
                wind_load_tr +  wind2_load_tr + wind3_load_tr + 
                wind_load2_tr + wind2_load2_tr + wind3_load2_tr +
                wind_load3_tr +wind2_load3_tr + wind3_load3_tr +
                load_tr + load2_tr + load3_tr +
                spp_load_tr + spp_load2_tr + spp_load3_tr, data)
beta_gt = matrix(coef(mggen_gt)[2:13], nrow=3, ncol=4)
#>
#< task_notest
# These lines use our function mea() in order to produce and save a plot
# for each type of fossil fuel power plant.
library(ggplot2)
pcoal = ggplot(data.frame(x = c(23500, 55000)), aes(x = x)) +
  stat_function(fun = mea, args = list(beta = beta_coal, W = 125), aes(colour=" 125")) +
  stat_function(fun = mea, args = list(beta = beta_coal, W = 820), aes(colour=" 820")) +
  stat_function(fun = mea, args = list(beta = beta_coal, W = 1935), aes(colour="1935")) +
  stat_function(fun = mea, args = list(beta = beta_coal, W = 3285), aes(colour="3285")) +
  xlab("Demand [MWh]") +
  ylab("Marginal Coal Generation Avoided [MWh / MWh]") +
  scale_y_continuous(limits=c(-0.2,1)) +
  scale_colour_manual("Wind [MWh]", values = c("red", "blue", "green", "orange"))

pcc = ggplot(data.frame(x = c(23500, 55000)), aes(x = x)) +
  stat_function(fun = mea, args = list(beta = beta_cc, W = 125), aes(colour=" 125")) +
  stat_function(fun = mea, args = list(beta = beta_cc, W = 820), aes(colour=" 820")) +
  stat_function(fun = mea, args = list(beta = beta_cc, W = 1935), aes(colour="1935")) +
  stat_function(fun = mea, args = list(beta = beta_cc, W = 3285), aes(colour="3285")) +
  xlab("Demand [MWh]") +
  ylab("Marginal CC Generation Avoided [MWh / MWh]") +
  scale_y_continuous(limits=c(-0.2,1)) +
  scale_colour_manual("Wind [MWh]", values = c("red", "blue", "green", "orange"))

pgt = ggplot(data.frame(x = c(23500, 55000)), aes(x = x)) +
  stat_function(fun = mea, args = list(beta = beta_gt, W = 125), aes(colour=" 125")) +
  stat_function(fun = mea, args = list(beta = beta_gt, W = 820), aes(colour=" 820")) +
  stat_function(fun = mea, args = list(beta = beta_gt, W = 1935), aes(colour="1935")) +
  stat_function(fun = mea, args = list(beta = beta_gt, W = 3285), aes(colour="3285")) +
  xlab("Demand [MWh]") +
  ylab("Marginal GT Generation Avoided [MWh / MWh]") +
  scale_y_continuous(limits=c(-0.2,1)) +
  scale_colour_manual("Wind [MWh]", values = c("red", "blue", "green", "orange"))

# These lines load the package 'lemon' and arrange the three plots next to each
# other for easier comparison using 'grid_arrange_shared_legend()'.
library(lemon)
grid_arrange_shared_legend(pcoal, pcc, pgt, nrow = 1, position = 'bottom')
#>
#< task_notest
# Show once more the plot of the MEA for CO2.
plot_mea("co2", data)
#>

The three upper plots show how the marginal output from the three types of fossil fuel technologies is replaced by wind energy depending on the levels of demand and wind. In hours with low demand, only very little generation from gas turbines is offset. This coincides with what the merit order for Texas in exercise 1.1 shows, namely that these types of power plants are usually deployed at hours of peak demand. The left plot above suggests that coal units are base load power plants because their generation is substantially replaced when demand is low, especially at high levels of wind generation. This seems to contradict what the merit order in exercise 1.1 shows. There coal-fired power plants produce at higher levels of demand. But we have to keep in mind that the depicted merit order is only a simplified representation and the real power plant dispatch depends on many factors, for example gas prices, as Rhodes et al. (2017) illustrates. We will therefore rely on the results from our data. The figures above show that CC is replaced at lower levels of demand as well, especially at low levels of wind generation. For very high levels of demand and low levels of wind generation, the results suggest that CC generation might increase in response to wind energy.
Coal is a CO$_2$-intensive technology, which can explain the initially high MEA for CO$_2$ at low-demand hours. Since we are in the base-load range at low levels of demand, more wind energy replaces more coal. This is why, at low demand levels, the amount of offset CO$_2$ increases with the level of wind generation.
The replacement of coal generation decreases with increasing demand. That of CC initially increases before decreasing beyond about $35,000\,\text{MWh}$ of demand. This suggests that CC power plants in our dataset are used when medium electricity levels are needed. With increasing demand, more and more gas turbines are replaced, which supports the assumption that they are mostly generating at high levels of demand. CC and GT are cleaner technologies than coal when it comes to CO$_2$, which explains why, when demand increases, we initially find decreasing marginal CO$_2$ emissions avoided. The replacement of GT dominates at high levels of demand, which can explain why the amount of CO$_2$ avoided increases again for higher levels of demand. We find increasing offsets in GT generation for decreasing levels of wind generation. The reason for this is that higher wind levels still offset generation from CC power plants. This behaviour explains why we find higher MEA at lower levels of wind generation.

b) Marginal NO$_x$ Emissions Avoided

Now let's take a look at the marginal emission avoided for NO$_x$.


< quiz "Factors Driving MEA for NOx"

question: What are the key factors in the marginal NOx emissions avoided due to wind energy? Keep in mind the plots for the marginal generation avoided and the table with the emission intensities. Multiple answers are correct. mc: - At low-demand hours, mostly coal units curtail their production, which reduces the emission of NOx considerably. - At low-demand hours, mostly CC units curtail their production, which reduces the emission of NOx considerably. - At low-demand hours, mostly GT units curtail their production, which reduces the emission of NOx considerably. - At high-demand hours, mostly coal units curtail their production, which reduces the emission of NOx considerably. - At high-demand hours, mostly CC units curtail their production, which reduces the emission of NOx considerably. - At high-demand hours, mostly GT units curtail their production, which reduces the emission of NOx considerably. success: Great, your answer is correct. CC gas generators emit comparably little NOx, so while they do reduce their production especially during low- and medium-demand hours, they do not account for a lot of the NOx reduction. At lower-demand hours, wind generation offsets only little generation from gas turbines, but large quantities from coal units. GT units operate mostly during high-demand hours and emit almost as much NOx as coal units. So they account for most of the emission when the electricity demand is high. failure: Your answer is not correct. Look carefully at the plots and the table.

>


< quiz "NOx MEA"

question: Based on the considerations in the last quiz, how do you expect the plot for NOx to look? Multiple answers are correct. mc: - The MEA increase with wind generation at low levels of demand. - The MEA decrease with wind generation at low levels of demand. - The MEA increase with demand at higher levels of demand. - The MEA decrease with demand at higher levels of demand. - The MEA increase with wind generation at higher levels of demand. - The MEA decrease with wind generation at higher levels of demand.* success: Great, your answer is correct. failure: Your answer is not correct. Check again the correct answer of the quiz above and consider the plots for coal, CC and GT.

>


< award "Predictor"

Well done, you used your knowledge about the emissions of different types of power plants and their response to wind power to deduce how the plot for the marginal NOx emissions avoided should look like.

>

Run the next two code chunks to compute and plot the marginal emissions avoided for NO$_x$ as well as show again the three plots for the marginal generation avoided.

#< task_notest
# Show the three plots for the marginal generation avoided.
grid_arrange_shared_legend(pcoal, pcc, pgt, nrow = 1, position = 'bottom')
#>
#< task_notest
# Plot the marginal NOx emissions avoided using 'plot_mea()'.
plot_mea("nox", data)
#>

As with CO$_2$, the estimates of $MEA(W,D)$ for NO$_x$ show that the avoided emissions increase with the level of wind generation for demand below about $35,000\,\text{MWh}$. This behaviour is driven by the replacement of coal-fired electricity. With increasing demand, wind generation initially replaces less coal and more of the cleaner CC generation, which is why we see the offset of NO$_x$ go down. For demand beyond $45,000\,\text{MWh}$, the MEA increase again and reach higher levels for lower levels of wind generation. This can be explained by the growing substitution of gas turbines, which have similar NO$_x$ emission intensities to coal-fired generation.

c) Marginal SO$_2$ Emissions Avoided

Finally, we will analyse the MEA for SO$_2$.


< quiz "Factors Driving MEA for SO2"

question: What is the key factor driving the reduction in marginal SO2 emissions due to wind energy? Keep in mind the plots for the marginal generation avoided and the table with the emission intensities.

sc: - The reduction in coal generation accounts for most of the marginal SO2 emissions avoided.* - The reduction in CC generation accounts for most of the marginal SO2 emissions avoided. - The reduction in GT generation accounts for most of the marginal SO2 emissions avoided. success: Great, your answer is correct. Gas-fired units only emit little SO2 while coal emits large quantities. So the marginal SO2 emissions follow the reduction in coal generation. failure: Your answer is not correct. Look carefully at the plots and the table.

>


Run the next code chunks to create the plots.

#< task_notest
# Show the three plots for the marginal generation avoided.
grid_arrange_shared_legend(pcoal, pcc, pgt, nrow = 1, position = 'bottom')
#>
#< task_notest
# Plot the marginal SO2 emissions avoided using 'plot_mea()'.
plot_mea("so2", data)
#>

The panel on the bottom shows the estimated quantities of SO$_2$ avoided due to wind energy in the ERCOT market. It reveals that the SO$_2$ pollution offset is highest at hours with low demand. Here, the avoided quantities increase with increasing level of wind production while at higher levels of demand, beyond about $40,000\,\text{MWh}$, this relation reverses. The MEA plot closely resembles that of avoided coal generation because the emission of SO$_2$ is largely caused by coal-fired generators.

Overall we find that, in low-demand hours, the highest quantities of pollutants are offset at high levels of energy generation from wind turbines. In high-demand hours, more pollutants are avoided at lower levels of wind generation. This is one of the reasons why Novan speaks of heterogeneity in the context of pollution from conventional power plants that is offset in response to energy from wind turbines.

In exercise 4, we will see how to use $MEA(W,D)$ to estimate the emissions that are offset by new wind turbines.

Exercise Excursus -- Loops vs. Vectorisation

In this excursus we will highlight the question whether it is better to use loops or vectorised expressions in R.
It is widely considered best practice in R to vectorise code instead of using for-loops where possible (see for example Burns (2011)). This means using vectors or matrices and applying functions or operations on them in one go. In contrast, a for-loop applies all operations on one element at a time. Vectorised code is usually shorter, and therefore less error-prone, and often faster as well (Burns (2011)).
One reason why using vector operations and R functions is usually faster, is that a lot of R functions make use of code written in other programming languages, like C or Fortran. The looping is then done in these languages, which work a little different than R does and can execute loops faster. A good and intelligible description of why R sometimes works slower than other programming languages do, is given in Ross (2014). So applying R functions on vectors, matrices or even higher-order arrays can save a lot of computation time.

As an example, the next code chunk shows a vectorised version of our function mea() called mea_vec(). For comparison, mea() is provided as well. Run the next code chunk to define both functions.

#< task
# Our function 'mea()', which we derived in exercise 3.1.
mea = function(beta,W,D) {
  mea = 0

  for (i in 1:3) {
    for (j in 0:3) {

      mea = mea - i*beta[i,j+1]*W^(i-1)*D^j

    }
  }
  return(mea)
}

# A vectorised version of 'mea()'.
mea_vec = function(beta,W,D) {

  mea = -sum(t(1:3*beta*W^(0:2))*D^(0:3))

  return(mea)
}
#>

As you can see, the vectorised function requires only one line of code within the function call. In order to understand the function you have to know a few peculiarities in R, which might at first not seem intuitive (especially when coming from other programming languages), but which can be used to vectorise code.
Let's take a look at the expression 1:3*beta. 1:3 simply defines a vector $(1,2,3)$. If beta was a scalar, this would just be scalar multiplication. beta would be applied to each element of the vector and the result would be $(1\times \text{beta}, 2\times \text{beta}, 3\times \text{beta})$. But beta is a $3 \times 4$ matrix in our case. This, however, does not make much of a difference to R. In the same way as it reuses the scalar in the case of scalar multiplication, it reuses the elements of the vector and multiplies it with each column (it is important to know that R will do this by column). The result is again a matrix with the same number of rows and columns as the original matrix. This is referred to as "Recycling" (Venables et al. (2017, chapters 2.2 & 5.4.1)). Below you find a simple example illustrating what happens when multiplying a vector and a matrix in R. Run the code chunk.

#< task
# Create a vector 'a' and a matrix 'b'.
a = 1:3
b = matrix(1:12, nrow = 3)

# Show 'a' and 'b'.
a
b

# Compute and show 'a*b'.
a*b
#>

Frazee (2014) explains a few more of these peculiarities in R and how you can use them when vectorising your code.

The rest of the expression in mea_vec() uses basically the same mechanism. t() transposes the matrix and sum() sums over all its elements.

Run the next code chunk to verify that the two functions for computing $MEA(W,D)$ actually give the same result.

#< task
# First load the dataset 'data.rds'.
data = readRDS("data.rds")

# Then run the regression 'mgem_co2' again, as specified in (3.1-1).
mgem_co2 = lm(txok_co2_tr ~
                wind_tr + wind2_tr + wind3_tr +
                wind_load_tr + wind2_load_tr + wind3_load_tr + 
                wind_load2_tr + wind2_load2_tr + wind3_load2_tr +
                wind_load3_tr + wind2_load3_tr + wind3_load3_tr +
                load_tr + load2_tr + load3_tr +
                spp_load_tr + spp_load2_tr + spp_load3_tr, data)

# Extract the required coefficients.
beta_co2 = matrix(coef(mgem_co2)[2:13], nrow=3, ncol=4)

# Run both functions with the same values.
mea(beta_co2, 820, 30000)
mea_vec(beta_co2, 820, 30000)
#>

Now let's compare the computation time for both versions of the function. To do so, we can use the function microbenchmark() from the identically named package. It runs both functions a specified number of times and each time measures how long the computation takes. Click check below to run the code.

#< task
# Load the package 'microbenchmark'.
library(microbenchmark)

# Use the function 'microbenchmark()' to run both 'mea()' and 'mea_vec()'
# 10000 times, each time measuring the computation time. An overview of the
# results is shown with 'summary()'.
summary(microbenchmark(mea(beta_co2, 820, 30000), mea_vec(beta_co2, 820, 30000),
               times=10000, unit='us'))
#>

As you can see, on average the computation of mea() takes longer than that of mea_vec(). Of course, both functions are very quick (the times are displayed in microseconds, as specified by the argument unit='us') and we would not notice the time difference when using the functions once. But for more elaborate functions, and when a lot more iterations are required, the difference in run time can be large, making the vectorised function the better choice.

While it is important to know about the advantages of vectorised R code, especially for those of you who want to get a deeper understanding of R, we will still continue to use for-loops in the rest of the problem sets. Our functions are comparably small and easy and the difference between a vectorised version or one using for-loops is of little consequence. The advantage of for-loops in my opinion, is that they are more intuitive and usually easier to read, especially for beginners in R.

Exercise 4 -- Average Emissions Avoided by Wind Capacity Additions

In exercise 3.2, we computed how marginal increases in power production from wind turbines affect the emission of CO$_2$, NO$_x$ and SO$_2$ from fossil fuel power plants in the short run. Ultimately, our goal is to judge the effectiveness of different policies that seek to promote investments in renewable energies. To that end, we have to examine the impact of installing additional wind capacity.

a) A Model for the Average Emissions Avoided

A power unit's capacity is its maximal power output and can be measured in megawatt ($\text{MW}$). Most of the time the unit will not produce at its maximum, but at varying levels below the maximal capacity. A power plant's utilisation is described by the capacity factor $x$, which is defined as its production over a certain period of time divided by the capacity $K$. It is dimensionless and takes values between zero and one. We can express the capacity factor for a wind turbine as follows:

$$x=\frac{W}{K} \qquad (4-1)$$

We need the hourly capacity factor (or electricity generation and capacity) for each of the wind turbines in the ERCOT market. Unfortunately, the original dataset only contains the aggregate production over all wind turbines. Novan solves this problem by assuming that all ERCOT wind turbines have the same capacity factor. He argues that the wind turbines in Texas are all installed in the same area and are therefore experiencing similar wind conditions. Implicitly, this also means that he assumes that the turbines are all equally affected by curtailments. Novan calculated the hourly capacity factor by dividing the aggregate hourly wind generation by the respective total capacity (cf. page 310 in the paper). He retrieved the latter from the Public Utility Commission of Texas. Since only the month of the capacity additions is reported here, Novan used articles from local newspapers to find the exact date, on which each of the new turbines was taken into operation. Novan's values for the capacity factor are stored as cap_factor in our dataset.

As already mentioned above, we are interested in the amount of offset emissions due to increases in wind capacity. More precisely, we want to compute the average effect of capacity additions over all levels of demand and utilisation. To that end, we first rewrite equation (3.1-2) for the marginal emissions avoided, expressing $W$ by the capacity factor $x$ and the total wind capacity $K$ according to equation (4-1):

$$MEA(x\cdot K,D)= \frac{-\partial E(x\cdot K,D)}{\partial W}=-\sum_{i=1}^3 \sum_{j=0}^3 i\cdot \beta_{i,j} \cdot (x\cdot K)^{i-1} \cdot D^j$$

Novan then devises a formula for calculating the emissions that are on average avoided in one hour by the $K$'th $\text{MW}$ of installed wind capacity:

$$\text{Average Avoided Emissions}(K)=\int_{x=0}^{x=1} \int_{\underline D} ^{\overline D} x \cdot MEA(x\cdot K,D) \cdot f(x,D)dDdx$$

Let's break the equation down so we understand what it means. $MEA(x\cdot K,D)$ describes how much pollution is offset by the next $\text{MWh}$ of wind energy when the capacity factor is $x$, demand is $D$ and the level of wind capacity is $K$. An additional $\text{MW}$ of wind capacity will at that time offset another $x \cdot MEA(x \cdot K,D)$ of pollutants. This value is weighted with $f(x,D)$, which is the joint distribution of hourly ERCOT demand and the wind capacity factor. In other words, it describes the probability of having an hour, in which the capacity factor is $x$ and demand is $D$. The whole expression is integrated over all possible values of $x$ and $D$, where $\underline D$ and $\overline D$ denote the minimum and maximum values of ERCOT demand that we observe in the evaluation period.
$f(x,D)$ is invariant to the amount of installed wind capacity because of the assumption that all wind turbines have identical capacity factors - in any given hour each additional wind turbine has the same capacity factor as the ones that were already installed before (cf. Novan, page 309). But what does change is the probability of finding any given combination of wind production $W$ and demand $D$ because, as the capacity grows, the likelihood of finding larger total output from wind turbines increases. We learned in exercise 3.2, that the amount of avoided pollutants by the marginal $\text{MWh}$ of wind output depends on the overall level of wind generation. This implies that the effect of an additional $\text{MW}$ of installed wind capacity will vary depending on the capacity of already installed wind turbines.
The equation for the average avoided emissions above describes the effect of the $K$'th $\text{MW}$ of installed wind capacity on average hourly emissions. If we divide this by the mean hourly capacity factor, $\bar x$, we receive the average impact of one $\text{MWh}$ from the $K$'th $\text{MW}$ of installed wind capacity on emissions. This is the value that we ultimately want to compute. Novan calls it average emissions avoided (AEA):

$$AEA(K)=\frac{\int_{x=0}^{x=1} \int_{\underline D} ^{\overline D} x \cdot MEA(x\cdot K,D) \cdot f(x,D)\ dDdx}{\int_{x=0}^{x=1} \int_{\underline D} ^{\overline D} x \cdot f(x,D)\ dDdx} \qquad (4-2)$$

The rest of this exercise focuses mainly on how to compute $AEA(K)$ in R. If you are only interested in seeing and discussing the results, you may skip parts b) and c) and continue with exercise 5.

b) Estimating the Joint Distribution Function

What we have left to find before we can compute $AEA(K)$, is an estimate for the joint distribution function $f(x,D)$. We will compute this using the observed data in our dataset. We want to estimate the joint distribution function for two variables: the capacity factor and ERCOT demand. Imagine the two variables being on the axes of the base plane in three-dimensional space. Following Novan's approach, we divide each axis into $40$ bins of equal size, spanning the whole range of observed values of both variables. This leaves us with a total of $1,600$ bins, into which all $43,795$ capacity factor-demand pairs in our dataset are sorted. Novan approximates this frequency distribution in Stata with the function kdens2, which uses a Gaussian kernel in order to compute and plot the kernel density estimate for two variables (Baum (2017)). If you are not familiar with kernel density estimation, you can take a look at the info box below. We will use the function KernSur() from the package GenKern to replicate the computations in R. The info box also gives you more information on this package. Run the next code chunk to load the dataset data.rds.

#< task
# Load the dataset 'data.rds'.

data = readRDS("data.rds")
#>

< info "Kernel Density Estimation"

The realisation of a set of observations is based on a probability distribution describing the data generating process. The exact distribution (or density) is often not known, but has to be estimated. One method of obtaining a density estimate is by kernel density estimation.
A kernel is a positive and symmetric function, which integrates to one. Each kernel $K$ is centred around a data point $X_i$ and measures the number of points that are close to $x$. For one dimension it has the general form

$$\frac{1}{h}\cdot K \left(\frac{x-X_i}{h}\right)$$

where the bandwidth $h$ defines which points are considered to be close. The density estimate is the sum of all $n$ individual kernels:

$$\hat f(x) = \frac{1}{nh}\sum_{i=1}^n K \left(\frac{x-X_i}{h}\right)$$

For $d$ dimensions, the multivariate kernel density estimator looks as follows:

$$\hat f(x) = \frac{1}{n|H|}\sum_{i=1}^n K \left(H^{-1}(x-X_i)\right)$$

$H$ is here the $d \times d$ bandwidth matrix. We will estimate a distribution in two dimensions.
There are various functions which are used as kernels. For our estimations, we will use Gaussian (or normal) kernels just as Novan does in his paper. They are of the following form:

$$K(x)=\frac{1}{2\pi\sqrt{det(\Sigma)}} exp\left(-\frac{1}{2}(x - \mu) '\Sigma^{-1}(x - \mu)\right)$$ where - $\Sigma$ is the covariance matrix and - $\mu$ is the two-dimensional mean vector.

We will not place one kernel function around each observation, but rather divide the data range into bins and centre one kernel around each bin midpoint, just as Novan does. The value at each of those points is a weighted function of the observations in the bin. This procedure is called binned kernel estimation and decreases the computational time for estimating the density function.
(Hollander et al. (2014, chapter 12) and Scott (2015, chapter 6.3.2))

The package GenKern in R contains functions to estimate kernel density distributions, both for one variable (univariate) and for two (bivariate). There are two main functions in the package: KernSec(), which estimates univariate distributions and KernSur(), which estimates bivariate distributions. We are using the latter to estimate $f(x,D)$. Both functions use Gaussian kernels for the computations (Lucy & Aykroyd (2015)).

>

Now load the package GenKern and estimate the joint distribution function using KernSur(). You have to specify the two variables cap_factor and load (in this order) and the number of bins for both variables, using the arguments xgridsize and ygridsize. Save the estimate for $f(x,D)$ as dist_est. It might take a little longer than usual to compute the joint distribution function.

#< task
# Load the required package.

# Then estimate the joint distribution function f(x,D).
#>

library(GenKern)
dist_est = KernSur(data$cap_factor, data$load, 40, 40)

Let's see what kind of object is returned by KernSur. Use the base-function typeof() to find out what type of object dist_est is.

#< task
# What kind of object is 'dist_est'? Find out using `typeof()`.
#>

typeof(dist_est)

dist_est is a list, a very flexible object, which can consist of several objects of different types (Venables et al. (2017, chapter 6.1)). Can we find out a bit more about these objects? Use names() to learn the number of objects in dist_est and their names.

#< task
# How many objects are in 'dist_est' and what are they called?
# Find out using 'names()'.
#>

names(dist_est)

There are three objects in dist_est, called xords, yords and zden. Let's find out more about the first object. Examine xords using head() and length(). As the name suggests, with this last function you obtain or set the length of an object. The notation using the $-sign can be applied for retrieving an object in a list in the same way as accessing a column in a data frame: list$object.

#< task
# Examine 'xords' using 'head()' and 'length()'.
#>

head(dist_est$xords)
length(dist_est$xords)

xords is a vector with $40$ elements. These are the $40$ bin midpoints of the capacity factor from the joint distribution estimation. Accordingly, yords contains the $40$ bin midpoints for demand. Read through the info box below for a few rather technical remarks on bin midpoints and the joint distribution estimation.

< info "A Few More Remarks on Estimating the Joint Distribution Function"

Lucy & Aykroyd (2015) point out that while the points at which the distribution is estimated in KernSur() can be thought of as bin midpoints, technically they are not. For simplicity we will nevertheless stick to the interpretation of them as bin midpoints.
Furthermore, you might have noticed that the first bin midpoint for the capacity factor is smaller than zero even though $x$ can only take values between zero and one. We could set the first point to zero (which is what Novan does, cf. his Stata code), but since it is very close to zero, we will ignore this.
Finally please be aware that our estimated density values as well as the points they are estimated at will not be exactly the same as the ones which Novan gets in Stata. This is to be expected and of little concern for the following computations.

>

What is left now, is to find out more about zden. Use dim() to get the dimensions of the object.

#< task
# Use 'dim()' to get the dimensions of 'zden'.
#>

dim(dist_est$zden)

zden is a $40 \times 40$ matrix, which contains the density estimates for each of the $1,600$ capacity factor-demand combinations. The lines correspond to the capacity factor and the columns to the demand.

c) Computing AEA(K) in R

Let's come back to equation (4-2) for $AEA(K)$. We will use the bin midpoints xords and yords as the discrete values for $x$ and $D$, respectively, and of course zden for $f(x,D)$. The estimate for $AEA(K)$ then looks as follows:

$$\widehat{AEA}(K)=\frac{\sum_{i=1}^{40} \sum_{j=1}^{40} x_i \cdot \widehat{MEA}(x_i \cdot K,D_j) \cdot \hat{f}(x_i,D_j)}{\sum_{i=1}^{40} \sum_{j=1}^{40} x_i \cdot \hat{f}(x_i,D_j)} \qquad (4-3)$$

where - $x_i$ and $D_j$ are the bin midpoints for the capacity factor and ERCOT load from the joint distribution estimation, - $\hat{f}(x_i,D_j)$ are the estimates of the joint distribution and - $\widehat{MEA}(x_i \cdot K,D_j)$ are the estimates of $MEA(x\cdot K,D)$.

We will now compute $\widehat{AEA}(K)$ step by step. First, save the three objects in dist_est (xords, yords and zden) as x, D and f, respectively.

#< task
# Save the three objects in 'dist_est' individually.
#>

x = dist_est$xords
D = dist_est$yords
f = dist_est$zden

We need our function for the marginal emissions avoided again for the computation of $\widehat{AEA}(K)$, with the minor change that $W$ is replaced by $x\cdot K$. Run the next code chunk to define the function mea() again for this exercise.

#< task
# This function computes MEA(x*K,D), similarly to the one we wrote in
# exercise 3.1. The only difference here is that 'W' is replaced by 'x*D'.

mea = function(beta,x,K,D) {
  mea = 0

  for (i in 1:3) {
    for (j in 0:3) {

      mea = mea - i*beta[i,j+1]*(x*K)^(i-1)*D^j

    }
  }
  return(mea)
}
#>

For mea() we need once more the $\beta_{i,j}$ from the regression in equation (3.1-1). Run the next code chunk to perform the regression again, for now only for CO$2$, and store the $\beta{i,j}$.

#< task
# Perform the regression specified in equation (3.1-1) for CO2.
mgem_co2 = lm(txok_co2_tr ~
                wind_tr + wind2_tr + wind3_tr +
                wind_load_tr + wind2_load_tr + wind3_load_tr + 
                wind_load2_tr + wind2_load2_tr + wind3_load2_tr +
                wind_load3_tr + wind2_load3_tr + wind3_load3_tr +
                load_tr + load2_tr + load3_tr +
                spp_load_tr + spp_load2_tr + spp_load3_tr, data)

# Store the beta_i,j.
beta_co2 = matrix(coef(mgem_co2)[2:13], nrow=3, ncol=4)
#>

Now we have all required elements for computing the average emissions avoided. As with the MEA, we have double sums, both in the numerator and the denominator, which we compute with two nested for-loops. We have to compute the numerator and the denominator separately within the loops and divide them afterwards. Just as any other function, mea() can be called within the loops with the required values.
Run the next code chunk, which already contains all commands. The start values for the numerator aea_num and the denominator aea_den are set first. Then follow the nested for-loops. The iteration index for x is i, the one for D is j. They specify which bin midpoint is plugged into the equation in each iteration. We use $K=5,000\,\text{MW}$ in mea().

#< task_notest
# Run the code for computing the numerator and denominator of AEA(K)
# in two nested for-loops.

aea_num = 0
aea_den = 0

for (i in 1:40) {
  for (j in 1:40) {

    aea_num = aea_num + x[i]*mea(beta_co2, x[i], 5000, D[j])*f[i,j]
    aea_den = aea_den + x[i]*f[i,j]

  }
}
#>

The only thing left to do now is to divide the two results.

#< task
# Calculate the final result.
#>

aea_num/aea_den

< quiz "Dimension AEA"

question: Which units does the result have?

sc: - It is dimensionless. - tons/MW - tons/MWh* success: Great, your answer is correct. failure: Your answer is not correct. Take another look at the equation (4-2) for AEA(K) or its description. Then try again.

>


< award "Master of Average Emissions Avoided"

Well done, you learned how to estimate and interpret the average emissions avoided!

>

When there are $5,000\,\text{MW}$ of wind capacity installed in Texas, the $5,000$'th unit is estimated to avoid on average $0.67\,\text{tons}$ of CO$_2$ emissions per $\text{MWh}$ of generated wind energy.

Of course, if we want to judge different wind and renewable energy policies, we have to study the impact of capacity additions for different levels of total installed wind capacity. This can be achieved by writing another function for $\widehat{AEA}(K)$, which we can then apply for all pollutants, different levels of capacity and even, as we will see in exercise 6.2, other renewable technologies. Since you already saw how to write a function in exercise 3.1, and the new function's content will essentially be familiar to you, the code is provided in the next chunk.

The new function is called aea() and takes the following variables as input:

The two latter arguments both have the default value $40$. Having them as arguments makes the function more general because we can change the number of bins for the joint distribution estimation. A description of the different parts in aea() and how they work is in the code chunk. Running it is optional because we will use the function in the next exercise for the first time.

#< task
# A function which computes AEA(K) according to formula (4-3).
aea = function(k, xdata, Ddata, beta, xbin=40, Dbin=40) {

  # Load 'GenKern' and estimate the joint distribution function 'dist_est'
  # for capacity factor and demand.
  library(GenKern)
  dist_est = KernSur(xdata, Ddata, xbin, Dbin)

  # Retrieve the vectors of bin midpoints 'x' and 'D' and the matrix with
  # the corresponding density values 'f'.
  x = dist_est$xords
  D = dist_est$yords
  f = dist_est$zden

  # Define a data frame 'df', which has two columns and as many rows as there
  # are elements in the vector with capacity levels 'k' that was given to the
  # function. The two columns get the names 'K' and 'AEA'. 'K' is filled with
  # the values in 'k'. 'AEA' contains at first only NA's, but will successively
  # be filled with the results for AEA(K) for each level of K.
  df = data.frame(K=k, AEA=rep(NA, times=length(k)))

  # Set the start values for 'aea_num' and 'aea_den'.
  aea_num = 0
  aea_den = 0

  # These are the for-loops for computing AEA(K). These are slightly changed
  # compared to before. First, the two iteration indices 'i' and 'j' stop at
  # the values specified by 'xbin' and 'ybin'.
  for (i in 1:xbin) {
    for (j in 1:Dbin) {

      # 'mea' is no longer computed in a separate function, but within the
      # loops. Since 'i' and 'j' are already taken by the outer loops, the
      # iteration indices are now 'm' and 'n'.
      mea = 0

      for (m in 1:3) {
        for (n in 0:3) {
          mea = mea - m*beta[m,n+1]*(x[i]*k)^(m-1)*D[j]^n

        }
      }

      # Compute 'aea_num' and 'aea_den'.
      aea_num = aea_num + x[i]*mea*f[i,j]
      aea_den = aea_den + x[i]*f[i,j]

      # Divide 'aea_num' by 'aea_den' and fill the appropriate line in column
      # 'AEA' of 'df' with the result. Choosing the line is accomplished by
      # requiring that the value in column 'K' equals the current value for the
      # capacity.
      df[which(df[,1]==k),2] = aea_num/aea_den

    }
  }

  # Return the now filled data frame 'df'.
  return(df)
}
#>

In exercise 5, we will use this function to compute the average emissions avoided due to wind energy for all three pollutants and different levels of capacity. Based on those results, we will express the avoided pollution in monetary terms to have a measure for the social benefits from wind energy. This will allow us to eventually judge renewable energy policies.

Exercise 5 -- Social Benefits from Wind Energy

This exercise ties in right where we left off in exercise 4. There we learned how to use the estimates for the emissions offset by the marginal $\text{MWh}$ of wind production for computing the average emissions that are avoided by generation from the marginal unit of wind capacity.
We will now use the self-written function aea() to compute the average emissions avoided in response to wind generation in Texas. Based on these results we will calculate the associated social benefits.

a) Average Emissions Avoided

Run the next code chunk to load data.rds and compute the $\beta_{i,j}$ as defined in equation (3.1-1). To do so, a self-written function called beta_ij() is used, which performs the regression for each of the three pollutants and retrieves the coefficients. If you are interested in the function, see the info box below. Recall that the regression estimates the effect of ERCOT wind production on pollution from fossil fuel power plants connected to the ERCOT or the SPP grid.

#< task
# These lines load data.rds and call 'beta_ij()'.

data = readRDS("data.rds")
beta = beta_ij(data)
#>

< info "'beta_ij()': A Function for Computing the Coefficients"

This self-written function is used to compute the $\beta_{i,j}$ for all three pollutants. It runs the regressions, retrieves the three sets of coefficients and saves them in a list.

beta_ij = function(data) {

  mgem_co2 = lm(txok_co2_tr ~
                  wind_tr + wind2_tr + wind3_tr +
                  wind_load_tr + wind2_load_tr + wind3_load_tr + 
                  wind_load2_tr + wind2_load2_tr + wind3_load2_tr +
                  wind_load3_tr + wind2_load3_tr + wind3_load3_tr +
                  load_tr + load2_tr + load3_tr +
                  spp_load_tr + spp_load2_tr + spp_load3_tr, data)
  beta_co2 = matrix(coef(mgem_co2)[2:13], nrow=3, ncol=4)

  mgem_nox = lm(txok_nox_tr ~
                  wind_tr + wind2_tr + wind3_tr +
                  wind_load_tr + wind2_load_tr + wind3_load_tr + 
                  wind_load2_tr + wind2_load2_tr + wind3_load2_tr +
                  wind_load3_tr + wind2_load3_tr + wind3_load3_tr +
                  load_tr + load2_tr + load3_tr +
                  spp_load_tr + spp_load2_tr + spp_load3_tr, data)
  beta_nox = matrix(coef(mgem_nox)[2:13], nrow=3, ncol=4)

  mgem_so2 = lm(txok_so2_tr ~
                  wind_tr + wind2_tr + wind3_tr +
                  wind_load_tr + wind2_load_tr + wind3_load_tr + 
                  wind_load2_tr + wind2_load2_tr + wind3_load2_tr +
                  wind_load3_tr + wind2_load3_tr + wind3_load3_tr +
                  load_tr + load2_tr + load3_tr +
                  spp_load_tr + spp_load2_tr + spp_load3_tr, data)
  beta_so2 = matrix(coef(mgem_so2)[2:13], nrow=3, ncol=4)

  beta = list("co2"=beta_co2, "nox"=beta_nox, "so2"=beta_so2)
  return(beta)
}

>

Now to computing $\widehat{AEA}(K)$ as defined in equation (4-3), first for CO$_2$. Just as Novan, we will do the computations for the following capacity levels $K$: $0$, $1000$, $2000$, $3000$, $4000$, $5000$, $6000$, $7000$, $8000$, $9000$ and $10,000\,\text{MW}$ (see also the info box below). Run the next code chunk to compute $\widehat{AEA}(K)$ for CO$_2$ at the eleven capacity levels (we do not have to define aea() again because the function is provided to the problem set in a separate R-file).

#< task
# AEA(K) is computed for CO2 using 'aea()'. Then the result is shown.

aea_wind_co2 = aea((0:10)*1000, data$cap_factor, data$load, beta$co2)
aea_wind_co2
#>

< info "A Side Note on the Capacity Levels"

Please note that the level of $10,000\,\text{MW}$ slightly exceeds the actual maximum installed ERCOT wind capacity during our period of observation from 2007 to 2011. According to ERCOT (2017a), this amounts to $9,604\,\text{MW}$ by the end of 2011. So we are extrapolating slightly. Similarly, at the end of 2006 there were $2,875\,\text{MW}$ of ERCOT wind capacity installed, so we do not observe wind generation from marginal capacities below this level.

>

You can see how aea() produces a data frame with two columns, "K" and "AEA", the latter of which containing the final result for each level of capacity. The first $\text{MW}$ of installed wind capacity is estimated to offset on average $0.65\,\text{tons}$ of CO$_2$ per $\text{MWh}$ of wind generation. The average emissions avoided increase with capacity and reach $0.74\,\text{tons/MWh}$ for the $10,000$'th $\text{MW}$, which is an increase of $15\%$ compared to the first $\text{MW}$ of installed capacity. Before taking a closer look at the results, let's compute $\widehat{AEA}(K)$ in the same way for the other two pollutants, NO$_x$ and SO$_2$. Just click check on the code chunk below.

#< task
# Compute AEA(K) for NOx and SO2.

aea_wind_nox = aea((0:10)*1000, data$cap_factor, data$load, beta$nox)

aea_wind_so2 = aea((0:10)*1000, data$cap_factor, data$load, beta$so2)
#>

Run this code chunk to plot $\widehat{AEA}(K)$ for each pollutant.

#< task_notest
# This code chunk plots the estimates for AEA(K) for each pollutant.

# The standard errors of AEA(K) are computed using the self-written
# function 'se_aea()'.
se_aea_wind = se_aea((0:10)*1000, data$cap_factor, data$load, data)

# Add two new columns to 'aea_wind_co2' containing the lower and upper bound
# for the error area.
library(dplyr)
co2 = mutate(aea_wind_co2, se_co2_min = AEA-2*se_aea_wind[,2],
             se_co2_max = AEA+2*se_aea_wind[,2])

# Create the plot for CO2. 'geom_ribbon()' displays an interval around
# the estimates of AEA(K) representing two times their standard error. 
library(ggplot2)
plot_co2 = ggplot(co2,aes(x=K, y=AEA)) +
  geom_line() +
  geom_ribbon(aes(ymin=se_co2_min, ymax=se_co2_max),alpha=0.3) +
  xlab("K [MW]") +
  ylab("AEA(K) for CO2 [tons/MWh]")

# Create the plot for NOx.
nox = mutate(aea_wind_nox, se_nox_min = AEA-2*se_aea_wind[,3],
             se_nox_max = AEA+2*se_aea_wind[,3])

plot_nox = ggplot(nox, aes(x=K, y=AEA)) +
  geom_line() +
  geom_ribbon(aes(ymin=se_nox_min, ymax=se_nox_max),alpha=0.3) +
  xlab("K [MW]") +
  ylab("AEA(K) for NOx [lbs/MWh]")

# Create the plot for SO2.
so2 = mutate(aea_wind_so2, se_so2_min = AEA-2*se_aea_wind[,4],
             se_so2_max = AEA+2*se_aea_wind[,4])

plot_so2 = ggplot(so2, aes(x=K, y=AEA)) +
  geom_line() +
  geom_ribbon(aes(ymin=se_so2_min, ymax=se_so2_max),alpha=0.3) +
  xlab("K [MW]") +
  ylab("AEA(K) for SO2 [lbs/MWh]")

# These lines load the package 'gridExtra' and arrange the three plots next
# to each other using 'grid.arrange()'.
library(gridExtra)
grid.arrange(plot_co2, plot_nox, plot_so2, ncol = 3,
             top="Average Emissions Avoided by K'th MW of Wind Capacity")
#>

The black lines represent the estimates for $AEA(K)$ and the dark grey areas around them cover two times their standard errors both upwards and downwards. For these, Novan assumes that the estimate for the joint distribution function $\hat f (x_i,D_j)$ describes the true distribution, which means that only $\widehat{MEA}(x_i \cdot K, D_j)$ in (4-3) contains errors (cf. page 310 in Novan's paper). Essentially only the $\beta_{i,j}$ have errors. The standard error of $\widehat{AEA}(K)$ is then the weighted sum of the elements of the variance-covariance matrix for $\widehat{MEA}(x_i \cdot K, D_j)$.
Here the standard errors are computed using a self-written function called se_aea(). If you are interested in seeing the code, you can take a look at the info box below.
The plots show that the average avoided emissions from the $K$'th $\text{MW}$ of installed wind capacity are estimated to increase in $K$ for all three pollutants. This can be explained when recalling what we found already in exercise 2.3. Here we learned that wind output and demand in Texas are negatively correlated and that wind production tends to be highest in hours with low demand. That means that these hours are most important in the effect of wind production on pollution. And from computing and plotting the marginal emissions avoided in exercise 3.2, we know that the offset emissions are in fact increasing with wind output during low-demand hours.
The results for CO$_2$ were already discussed above. The first unit of wind capacity offsets on average about $0.95\,\text{lbs}$ of NO$_x$ per $\text{MWh}$ and nearly $1.5\,\text{lbs}$ of SO$_2$ per $\text{MWh}$ of generation while the $10,000$'th $\text{MW}$ accounts for $1.15\,\text{lbs}$ of avoided NO$_x$ emission and nearly $2.25\,\text{lbs}$ of avoided SO$_2$ emissions per $\text{MWh}$. This corresponds to an increase by about $20\%$ for NO$_x$ and more than $50\%$ for SO$_2$ compared to the first $\text{MW}$.
What stands out in all three graphs, is that the standard error is larger for low and high levels of installed wind capacity than it is for medium levels. A reason for this can be the fact that during our period of observation there were between about $2,900\,\text{MW}$ and $9,600\,\text{MW}$ of wind capacity installed (ERCOT (2017a)). So we are extrapolating for capacities outside this range, which might make the estimates less precise.
The plots also reveal that, within the considered range, the increase of $\widehat{AEA}(K)$ for CO$_2$ accelerates with increasing $K$, while it slows down for NO$_x$. For SO$_2$, it increases at a fairly constant rate.

< info "'se_aea()': Computing the Standard Errors for AEA(K)"

This is the self-written function used to compute the standard errors for $\widehat{AEA}(K)$.

se_aea = function(K, xdata, Ddata, regdata, xbin=40, ybin=40) {

  library(GenKern)
  dist_est = KernSur(xdata, Ddata, xbin, ybin)

  x = matrix(rep(dist_est$xords, each = 40))
  D = matrix(rep(dist_est$yords, times = 40))
  f = matrix(t(dist_est$zden), ncol=1)

  for (k in K) {

    weights = matrix(nrow=1600, ncol=12)

    for (i in 1:1600) {
      for (m in 0:3) {
        for (n in 1:3) {
          weights[i,m*3+n]=n*(k*x[i])^(n-1)*D[i]^m
        }
      }
    }

    name = paste("weightsmea_", k, sep="")
    assign(name, weights)
  }



  for (i in c("co2", "nox", "so2")) {

    frml = formula(paste("txok_",i,"_tr ~
                  wind_tr + wind2_tr + wind3_tr +
                  wind_load_tr + wind2_load_tr + wind3_load_tr + 
                  wind_load2_tr + wind2_load2_tr + wind3_load2_tr +
                  wind_load3_tr + wind2_load3_tr + wind3_load3_tr +
                  load_tr + load2_tr + load3_tr +
                  spp_load_tr + spp_load2_tr + spp_load3_tr", sep=""))
    mgem = lm(frml, regdata)

    library(sandwich)
    vcov_mgem = NeweyWest(mgem,lag=24,prewhite=FALSE)[2:13,2:13] *
      (43794-18-1)/(43794-18-1825-9660)

    name = paste("vcov_mgem_", i, sep="")
    assign(name, vcov_mgem)
  }


  df = data.frame(K=K, co2=rep(NA, times=length(K)), nox=rep(NA, times=length(K)),
                  so2=rep(NA, times=length(K)))  

  for (i in c("co2", "nox", "so2")) {
    for (k in K) {

      weights = get(paste("weightsmea_", k, sep=""))
      vcov_mgem = get(paste("vcov_mgem_", i, sep=""))

      se = sqrt(t(x*f) %*% weights %*% vcov_mgem %*% t(weights) %*% (x*f) /
                  (t(x) %*% f)^2)
      df[which(df[,1] == k),which(colnames(df)==i)] = se[1,1]

    }
  }

  return(df)
}

>

b) Social Benefits from Wind Energy

Now that we know what short-run impact the marginal $\text{MW}$ of ERCOT wind capacity has on average emissions from fossil fuel power plants in Texas and Oklahoma, we can finally turn to the crucial step on our way to judging renewable energy policies: computing the social benefit of wind power. In other words, we will express the average emissions avoided in monetary terms. If you are not familiar with the term "social benefit" or if you could do with a little recap, take a look at the info box below.

< info "Externalities"

Cost or benefits of an action, which are not borne by the causer, are called externalities. A positive externality has a positive effect and a negative externality has a negative effect on someone who is not involved in the action causing this effect. This is also known as social (or external) benefit and cost, respectively.
In our case the operators of fossil fuel power plants do not (fully) bear the cost associated with negative effects of the emitted pollutants on, for example, the environment or the health of humans and animals. This cost ultimately has to be carried by society. This means that both power plant operators and the direct consumers do not consider these costs in their calculations. If they did, producing electricity would become more expensive and therefore less profitable and operators would reduce the emission of pollutants by either cutting down production or applying methods reducing emissions. The opposite applies to the operators of wind turbines and other renewable energy power plants.
The problem with externalities is that they make a market inefficient. In exercise 7, we will discuss how to counteract the social costs of emissions from fossil fuel power plants effectively.

(Jaeger (2005))

>

The first thing we need in order to compute the social benefits from wind energy, are estimates for the cost of pollution.
For the social cost of CO$_2$ emissions, Novan uses an estimate from the Interagency Working Group on Social Cost of Carbon (IAWG). He assumes that the annual discount rate is $3\%$. IAWG (2013) proposes a social cost for CO$_2$ of \$$32$ per ton (in 2010 dollars, see info box below for more information on the estimates). CO$_2$ is not regulated in Texas and Oklahoma (cf. RAP (2010)). Therefore, reducing emissions in one period of time does not affect emissions in subsequent periods or in a different region.

< info "Estimates for the Social Cost of Pollutants"

CO$_2$: The IAWG estimates the monetary value of damages due to CO$_2$ based on three models for different socio-economic scenarios (IAWG (2013)). The models include, amongst others, the impact of CO$_2$ on agriculture, the ecosystem and health.
Please note that the estimate of \$$33$ at a $3\%$ discount rate in IAWG (2013) is reported in 2007 dollars per metric ton, while Novan uses 2010 dollars per short ton. Converting from metric to short ton and adjusting for the cumulative inflation from 2007 to 2010 yields a rounded estimate of \$$32/\text{ton}$.

NO$_x$ and SO$_2$: Banzhaf & Chupp (2012) estimate the social costs for NO$_x$ and SO$_2$ for each US state using the Tracking and Analysis Framework (TAF) integrated assessment model. It combines various modules from different research areas to judge the pollutants' impact on health, taking into account the location of both the emission source and the resulting damage.
Note that they do not present detailed results for NO$_x$, but state that the values are provided when requested. I assume that is what Novan has done and rely on the value he reports in his Stata code, which is available with the original dataset at Novan (2015b). This value (\$$528.409/\text{ton}$) differs from the one Novan states in his paper (\$$548/\text{ton}$). But computing the values for the hourly external costs in the higher-impact scenario, yields the same result as Novan's values in his original dataset when using \$$528.409/\text{ton}$.

Alternative Estimates: Estimating the social costs of pollutants is difficult and there is an abundance of estimates based on different models and for different scenarios. These estimates vary substantially, how Tol (2008) shows for CO$_2$. He conducts a meta-analysis of 211 estimates for the social cost of CO$_2$, published between 1982 and 2006. Over all estimates from peer-reviewed studies, the mean is \$$21/\text{ton}$ at a discount rate of $3\%$. In a more recent study, Nordhaus (2016) finds an estimate of \$$31/\text{ton}$ in 2015 for the same discount rate.
There are fewer studies focusing on the other two pollutants. Alternative estimates can be found in Muller & Mendelsohn (2009). They estimate the expected marginal damage from NO$_x$ emissions in the US at \$$260/\text{ton}$ per year and that of SO$_2$ at \$$1,310$ per $\text{ton}$ and year.

>

The situation is less clear for NO$_x$ and SO$_2$ because the emission of these two pollutants in Texas is subject to the Clean Air Interstate Rule (CAIR) and the Acid Rain Program (ARP) during our period of observation (EPA (2014)). Both are cap and trade programmes, which means that there is a limit to the quantity of pollutants which power plants may emit. The cap is ensured by issuing a given number of allowances, granting the holder the right to emit a certain amount of the pollutant. The allowances can be traded between polluters. More information on CAIR and ARP can be found in the info box below.
If the caps set by CAIR and ARP are binding, wind energy will still result in contemporaneous reductions of NO$_x$ and SO$_2$ emissions. But with the then unused allowances, operators of fossil fuel plants can instead emit more pollutants at a later date or at a different place, making it very difficult to judge the impact of wind energy on these two pollutants. However, Novan assumes in his paper that the caps are non-binding (the info box "Are CAIR and ARP caps binding?" below explains why this is a reasonable assumption).
Novan uses estimates by Banzhaf & Chupp (2012) for the social costs of NO$_x$ and SO$_2$. They find values of \$$528.409/\text{ton}$ for NO$_x$ and \$$3,194.46/\text{ton}$ for SO$_2$. More detailed information on the estimates as well as alternative estimates are in the info box above.

< info "CAIR and ARP"

CAIR: The Clean Air Interstate Rule (CAIR) was a cap and trade programme by the United States Environmental Protection Agency (EPA) which ran from 2009 until the end of 2014. It consisted of three parts: an NO$_x$ ozone season program, which limited the amount of NO$_x$ emitted during the summer months when ozone pollution is highest, and annual NO$_x$ and SO$_2$ programmes. The two programmes regulating NO$_x$ began in 2009 while the SO$_2$ programme followed in 2010. Texas was affected by the two annual programmes. On January 1, 2015 CAIR was replaced by the Cross-State Air Pollution Rule (CSAPR).

ARP: The Acid Rain Program (ARP) began in 1995 and was extended in 2000. It then regulated the emission of NO$_x$ and SO$_2$ from coal-, gas- and oil-fired power plants. In 2010, the final annual limit for the SO$_2$ cap and trade programme was set to $8.95\,\text{million tons}$. NO$_x$ emissions are not regulated by an allowance trading system, but by setting specific limits on some of the coal-fired units.

(EPA (2014))

>

< info "Are CAIR and ARP Caps Binding?"

Novan considers the NO$_x$ and SO$_2$ caps from CAIR and ARP as non-binding. He justifies this assumption with two points.
For one, he argues that the prices for NO$_x$ and SO$_2$ allowances are very low. In fact, the prices for CAIR allowances continually decrease over the observation period: from almost \$$800/\text{ton}$ for NO$_x$ and over \$$500/\text{ton}$ for SO$_2$ in 2007 to \$$15.89/\text{ton}$ for NO$_x$ and \$$2.12/\text{ton}$ for SO$_2$ in 2011 (EIA (2012)). This is only a fraction of the estimated social costs by Banzhaf & Chupp (2012). The big drop in prices suggests that more and more allowances are unused and offered on the allowance market, making the price for them decrease.
This is additionally supported by Novan's second observation, that total emissions are lower than the cap levels. Indeed, SO$_2$ emissions in 2011 are below the ARP cap by $4.41\,\text{million tons}$ (EPA (2011)). They are higher than the CAIR cap by $940,000\,\text{tons}$, but power plant operators can circumvent the CAIR requirement by using the excess ARP allowances. In the same year, NO$_x$ emissions are below the CAIR cap by about $146,000\,\text{tons}$. All of this suggests that both CAIR and ARP caps are not binding.

>

Novan estimates the social benefits from wind energy for two scenarios. In the first one, the caps for NO$_x$ and SO$_2$ are assumed to be binding, which means that wind generation only offsets CO$_2$ emissions. The second one assumes that the caps are non-binding and all three pollutants are affected.
For each scenario, Novan calculates the hourly external cost of pollution by multiplying the hourly emission for the respective pollutants with their corresponding social cost. To compute estimates of the average external benefit from one $\text{MWh}$ generated by the $K$'th $\text{MW}$ of wind capacity, he basically uses the same approach as for the average emissions avoided $AEA(K)$. He only replaces the independent variable in the regression model (3.1-1) with the external cost of pollution (cf. Novan's Stata code available at Novan (2015b)). The marginal emissions avoided in (3.1-2) now represent marginal external benefits and the average emissions avoided in equation (4-2) are average external benefits.

We take a shorter, but equivalent, route. For each capacity level, we use the results for $\widehat{AEA}(K)$, which we computed above, and multiply those with the pollutants' social costs.
The code below assigns the pollutants' social costs to three new variables called sc_co2, sc_nox and sc_so2. Run the chunk.

#< task
# The social costs of the pollutants are stored in 'sc_co2', 'sc_nox'
# and 'sc_so2'.

sc_co2 = 32
sc_nox = 528.409
sc_so2 = 3194.46
#>

The next code chunk computes the average external benefits for the first scenario, in which only CO$_2$ is offset in the long run. Just press check.

#< task
# Calculate the average external benefits for the lower-impact scenario.

aeb_wind_low = sc_co2*aea_wind_co2[,2]
#>

Now let's do the same for the higher-impact scenario, in which all three pollutants are offset in the long run and which Novan assumes is more likely. Run the chunk to compute the total social benefit from avoiding CO$_2$ as well as NO$_x$ and SO$_2$. (The social costs are given in \$$/\text{ton}$, but $\widehat{AEA}(K)$ is in $\text{lbs/MWh}$ for NO$_x$ and SO$_2$ ($1\,\text{ton} = 2000\,\text{lbs}$). Therefore, the values have to be adjusted accordingly).

#< task
# Calculate the average external benefits for the higher-impact scenario.

aeb_wind_high = sc_co2*aea_wind_co2[,2] + sc_nox*aea_wind_nox[,2]/2000 +
  sc_so2*aea_wind_so2[,2]/2000
#>

< award "Master of Social Benefits"

You learned how to estimate the social benefits of wind power and are now able to interpret the results. Well done!

>

Run the next code chunk to draw two graphs showing the results.

#< task_notest
# This code chunk plots the estimates for the social benefits for both scenarios.

# These commands create a data frame with the capacity levels, the results for
# both scenarios and the lower and upper bound for their error areas.
aeb = data.frame(K = aea_wind_co2$K,
                 aeb_low = aeb_wind_low,
                 aeb_high = aeb_wind_high)

aeb_tot = mutate(aeb,
                 se_low_min = aeb_low - 2*se_aea_wind[,2]*sc_co2,
                 se_low_max = aeb_low + 2*se_aea_wind[,2]*sc_co2,
                 se_high_min = aeb_high -
                   2*(se_aea_wind[,2]*sc_co2+se_aea_wind[,3]*sc_nox/2000+
                        se_aea_wind[,4]*sc_so2/2000),
                 se_high_max = aeb_high +
                   2*(se_aea_wind[,2]*sc_co2+se_aea_wind[,3]*sc_nox/2000+
                        se_aea_wind[,4]*sc_so2/2000))

# These commands create the two plots.
plot_low = ggplot(aeb_tot, aes(x=K, y=aeb_low)) +
  geom_line() +
  geom_ribbon(aes(ymin=se_low_min, ymax=se_low_max), alpha=0.3) +
  xlab("K [MW]") +
  ylab("Social Benefit of Wind [$/MWh] - Only CO2 Offset") +
  scale_y_continuous(limits = c(16,32))

plot_high = ggplot(aeb_tot, aes(x=K, y=aeb_high)) +
  geom_line() +
  geom_ribbon(aes(ymin=se_high_min, ymax=se_high_max), alpha=0.3) +
  xlab("K [MW]") +
  ylab("Social Benefit of Wind [$/MWh] - CO2, NOx, SO2 Offset") +
  scale_y_continuous(limits = c(16,32))

# This command arranges the two plots next to each other for easier comparison.
grid.arrange(plot_low, plot_high, ncol = 2, 
             top="Average Social Benefit from K'th MW of Wind Capacity")
#>

The black lines show the social benefit as a function of the wind capacity level. The dark grey areas around them again represent two times the standard error both upwards and downwards. The standard error for the social benefits is computed as the weighted sum of the standard errors of $\widehat{AEA}(K)$.
Both graphs follow mostly the behaviour of the average emissions avoided for CO$_2$. The social benefit is increasing in $K$ and the growth rate becomes larger with the level of capacity. This means that, within the considered range, each additional unit of installed wind capacity has a larger social benefit than the one before and the additional benefit becomes larger for each additional $\text{MW}$. For the lower-impact scenario, this is of course due to the fact that the social benefit is computed by simply multiplying the results for the AEA with a constant factor. But even when all three pollutants are avoided in the long run, CO$_2$ is still the most important influence. The emission of NO$_x$ and SO$_2$ is associated with higher social costs, but their emission intensities are much lower than that of CO$_2$. We also find again, that the standard errors are larger for the lowest and highest considered capacity levels.
In the lower-impact scenario, the first $\text{MW}$ of installed wind capacity has an estimated average social benefit of about \$$21/\text{MWh}$ while the $10,000$'th $\text{MW}$ reaches close to \$$24/\text{MWh}$. This constitutes an increase of about $14\%$.
As to be expected, the values for the higher-impact scenario are consistently higher than those for the lower-impact scenario. We get an estimate of over \$$23/\text{MWh}$ for the average social benefit from the first unit of ERCOT wind capacity. It grows by about $20\%$ to more than \$$27.5/\text{MWh}$ for the marginal unit of capacity at $10,000\,\text{MW}$. By way of comparison: the average electricity wholesale price in Texas during the years 2007 to 2011 was \$$52/\text{MWh}$ (ERCOT (2017a)).

The estimations for both the average emissions avoided and the social benefits show what we have already found for the marginal emissions avoided in exercise 3.2: wind energy does not have the same effect on pollution in every situation. Here we see that it depends on the level of already installed wind capacity, how much of the pollutants is offset and therefore what the social benefit of an additional $\text{MW}$ of wind capacity is. Again we find heterogeneity in the impact of wind energy.
As already mentioned in exercise 1.1, fossil fuel power plants do not only emit CO$_2$, NO$_x$ and SO$_2$, but also other pollutants (for example other greenhouse gases or small particles). Novan points out that these estimations therefore most likely present a lower bound to the true social benefits from ERCOT wind power.

In this exercise, we computed the average emissions avoided from one $\text{MWh}$ of electricity generated by the marginal ERCOT wind capacity unit. We then translated these estimates into the social benefits that are offered through wind energy by offsetting these emissions. We learned that these grow with the level of installed capacity - they are heterogeneous in $K$.
The next two exercises will move away from wind power and turn instead to another renewable technology: solar energy. There we will perform similar computations as for wind energy and demonstrate that the social benefits from solar power differ from those of wind power.

Exercise 6.1 explores the characteristics of solar power. If you are only interested in analysing its social benefits, you may skip this exercise and continue with exercise 6.2.

Exercise 6.1 -- Solar Power: Characteristics of Solar Energy

An important topic in Novan's paper is heterogeneity. In exercises 3.2 and 5, we saw how the impact of wind energy in Texas on pollution from fossil fuel-fired generators varies with the level of installed wind capacity and demand. This means that not each new wind facility has the same impact on pollution, and even for one wind turbine, the effect can vary over time.
However, Novan does not only find heterogeneity within the same renewable technology, but also across technologies. Different renewable technologies can therefore not be treated as if they were essentially one. We will now explore this heterogeneity across technologies by estimating the external benefits for a second renewable energy source - solar power - and comparing it to the findings for wind energy. This will allow us to get a more comprehensive view of the impact of renewable energy in the ERCOT market on pollution. Based on these results, we will discuss renewable energy policies in exercise 7.

Before we begin with the analysis, let's find out more about the characteristics of energy production from solar photovoltaic (PV) panels in Texas. This will give you a better understanding of solar power and how it differs from wind power. As with wind power, we will examine the impact of solar energy in the ERCOT grid on emissions of CO$_2$, NO$_x$ and SO$_2$ from fossil fuel power plants in Texas and Oklahoma. Novan uses hourly data on solar production from 2012 provided by ERCOT (cf. section II in Novan's paper). Run the code chunk to load the dataset data_solar.rds. Besides the familiar variables datetime, month, day, hour and load, you will also find solar, which is the total hourly generation from ERCOT solar photovoltaic panels (in $\text{MWh}$), and solar_cap_factor, which is their hourly capacity factor.

#< task
# Load the dataset 'data_solar.rds' and store it in 'data_solar'.

data_solar = readRDS("data_solar.rds")
#>

Let's first take a look at the fluctuations in solar generation, more precisely in the capacity factor for solar generation. Recall that the capacity factor describes the utilisation of a power plant.


< quiz "Generation vs Capacity Factor"

question: What is the advantage of using the capacity factor instead of absolute generation when plotting solar generation over longer periods of time?

sc: - Plotting the capacity factor requires less data preparation compared to plotting absolute generation. - The data for the capacity factor is more reliable than that for absolute generation. - We do not have to worry about capacity additions when plotting the capacity factor.* success: Great, your answer is correct. According to ERCOT (2017a), ERCOT utility-level solar production (i.e. production excluding rooftop PV panels and microgrids) has nearly doubled in 2012, from 42 MW at the end of 2011 to 82 MW by the end of 2012. When plotting ERCOT solar generation over time, capacity additions can cause sudden jumps in the graph. Since the capacity factor measures the relative utilisation of the available solar capacity and we assume the same capacity factor for each solar unit, it is independent of the total capacity. (Note that we have the same problem when plotting wind generation in exercise 1.2. However, I chose a year with low relative wind capacity additions, which are therefore of little consequence.) failure: Your answer is not correct. Think about what would make it more difficult to draw conclusions from the plot.

>


< quiz "Plotting Solar"

question: What do you expect to see when plotting the capacity factor? Multiple answers are correct. mc: - There is no solar generation at night. - There is only little solar generation at night. - Solar production is highest during winter. - Solar production is highest during summer. success: Great, your answer is correct. We will see that solar PV panels do not produce electricity when there is no sunlight. And they are most productive during the summer months when the solar radiation is highest. failure: Your answer is not correct. Try again.

>


First, we plot the hourly solar capacity factor during October 2012. The procedure is the same as in exercise 1.2 when we created a similar plot for wind generation. Just press check.

#< task_notest
# These lines load the package 'dplyr' and create a subset of 'data_solar'
# containing the data for October only.
library(dplyr)
data_solar_oct = filter(data_solar, month==10)

# These lines load the package 'ggplot2' and plot 'solar_cap_factor' against
# 'datetime'.
library(ggplot2)
ggplot(data_solar_oct, aes(x=datetime, y=solar_cap_factor)) +
  geom_line() +
  xlab("Hour") +
  ylab("Hourly ERCOT Solar Capacity Factor") +
  ggtitle("Variation of Hourly Output From Solar PV Panels in Texas") +
  theme(plot.title = element_text(hjust = 0.5)) +
  scale_x_datetime(date_breaks = "2 days", date_labels = "%b %d")
#>

The graph shows clearly that there is no solar generation during the night. On most days in October, the capacity factor steadily increases until it reaches its maximum in the middle of the day, after which it decreases again. The capacity factor peaks at well over $80\%$ on about half of the days while there are also days with comparatively little solar production. This already shows a difference to wind energy. The plot in exercise 1.2 revealed that wind turbines tend to produce most energy during the night.

Next, we will create a graph which plots the daily mean solar capacity factor for each day in 2012. Again, the procedure is similar to the one in exercise 1.2, so just press check on the data chunk.

#< task_notest
# First, the daily average capacity factor is computed and saved in a new
# data frame.
data_solar_daily_1 = data_solar %>%
  group_by(month, day) %>%
  summarise(solar_avg = mean(solar_cap_factor)) %>%
  ungroup()

# Then a column 'date' is added to the new data frame.
library(lubridate)
data_solar_daily_2 = mutate(data_solar_daily_1,
                          date=make_date(year=2012, month=month, day=day))

# Finally, the plot is created.
ggplot(data_solar_daily_2, aes(x=date, y=solar_avg)) +
  geom_line() +
  xlab("Day") +
  ylab("Daily Average ERCOT Solar Capacity Factor") +
  ggtitle("Variation of Daily Average Output From Solar PV Panels in Texas") +
  theme(plot.title = element_text(hjust = 0.5)) +
  scale_x_date(date_breaks = "month", date_labels = "%b %d", minor_breaks = NULL)
#>

The graph shows how the daily average of the solar capacity factor varies over the course of the year. The variation can be substantial even from one day to the next. However, mean production tends to be higher during summer than during winter. That is to be expected because during summer, the days are longer and the solar radiation is higher than during the winter months. This is again opposite from what we found for wind generation in exercise 1.2, where the output tends to be lowest in summer.

Finally, let's check the correlation between solar generation and demand. Run the next code chunk to create two plots showing the hourly ERCOT demand in October and the daily average demand for 2012.

#< task_notest
# Create the plot for the hourly demand in October.
plot_hourly = ggplot(data_solar_oct, aes(x=datetime, y=load)) +
  geom_line() +
  xlab("Hour") +
  ylab("Hourly ERCOT Demand [MWh]") +
  scale_x_datetime(date_breaks = "2 days", date_labels = "%b %d")

# Calculate the daily mean demand for each day in 2012 and save it in a new
# data frame 'data_demand_daily_1'.
data_demand_daily_1 = data_solar %>%
  group_by(month, day) %>%
  summarise(demand_avg = mean(load)) %>%
  ungroup()

# Add a column 'date' to 'data_demand_daily_1'.
data_demand_daily_2 = mutate(data_demand_daily_1,
                           date=make_date(year=2012, month=month, day=day))

# Create the plot for the daily average demand in 2012.
plot_daily = ggplot(data_demand_daily_2, aes(x=date, y=demand_avg)) +
  geom_line() +
  xlab("Day") +
  ylab("Average Daily ERCOT Demand [MWh]") +
  scale_x_date(date_breaks = "month", date_labels = "%b %d", minor_breaks = NULL)

# Arrange the two plots on top of each other.
library(gridExtra)
grid.arrange(plot_hourly, plot_daily, nrow = 2,
             top="Variation of Demand in Texas")
#>

< quiz "Correlation of Solar and Demand"

question: Based on these two plots, and keeping in mind what we just learned about the variation of solar generation, how do you expect ERCOT demand and the solar capacity factor to be correlated.

sc: - I expect them to be uncorrelated. - I expect them to be positively correlated.* - I expect them to be negatively correlated. success: Great, that is indeed what we will find. The plots show that ERCOT demand peaks during daytime and is lowest during the early hours of the day. It also tends to be highest during the summer months. This is similar to what we found for the solar capacity factor. failure: This is not what we will find. Take another look at all four plots.

>


Now let's check whether our expectation is correct. Compute the correlation between load and solar_cap_factor.

#< task
# Compute the correlation between 'load' and 'solar_cap_factor'.
#>

cor(data_solar$load, data_solar$solar_cap_factor)

< award "Solar Power Expert"

You learned about the characteristics of solar power and how they compare to wind power and electricity demand. This is important for interpreting our results in exercise 6.2.

>

ERCOT demand and the capacity factor of solar PV panels is indeed positively correlated. So the PV panels generate most electricity during higher hours of demand. This is different to the relation between wind generation and demand. In exercise 2.3, we found that they are negatively correlated. This already suggests that the impact of solar power on pollution might be different to that of wind power.

We now have a better understanding of solar power. We will need this in the next exercise, where we will compute and analyse the average emissions avoided and the associated social benefits from solar generation in Texas.

Exercise 6.2 -- Solar Power: Social Benefits from Solar Energy

a) Average Emissions Avoided from Solar Power

To be able to compare the social benefits from wind and solar power, we will first estimate the average emissions avoided $AEA(K)$ from solar generation for each of the three pollutants. The equations are basically the same as before for wind, but wind output is replaced by solar output (go back to exercises 3.1 and 4 if you need to call the equations to mind).
Unfortunately, the amount of installed solar capacity in Texas is very low. We already saw this in exercise 1.2, where we plotted the average share in total production for different fuel types. The "other" category, which included solar generation, had only a minor contribution in the years 2007 to 2011. According to ERCOT (2017a), utility-scale solar capacity in the ERCOT grid amounted to $82\,\text{MW}$ by the end of 2012. This poses a problem because, as Novan points out, solar generation is not high enough for us to get reliable estimates in the regressions of the quantity of emitted pollutants on solar generation in equation (3.1-1). These are required for the computation of the marginal emissions avoided.
Therefore, Novan assumes that the MEA for solar and wind generation are the same, i.e. he assumes that the marginal $\text{MWh}$ of solar output avoids the same amount of pollutants as that of wind output at given levels of demand and wind or solar generation. This means that we can use the estimates of $MEA(x\cdot K,D)$, which we used in exercise 5 for wind. Read through the info box below for a few remarks on the validity of this assumption.

< info "MEA for Solar Energy"

Novan assumes that the marginal $\text{MWh}$ of solar power has the same effect on pollutants as wind energy for given levels of demand and wind or solar generation. The author supports this assumption with two observations. For one, he compares quarter-hourly data on ERCOT solar and wind generation in 2012 and finds them to be similarly fluctuating (cf. page 314 in Novan's paper). Furthermore, he argues that the northwest of Texas, where most of the wind capacity is installed, has the highest potential for solar energy, making it probable that solar PV panels are installed there as well.
According to more recent data, this did not eventuate. The figure below shows that solar PV panels are mostly installed in the southeast of Texas. This means Novan's assumption might not be valid and a source of imprecision in our results. However, computing the marginal emissions avoided with the little data we have, also gives imprecise results. Therefore we will still follow Novan's approach and use $MEA_{\text{wind}}$ for estimating the effect of solar energy. The resulting estimates can be improved in the future by using data on solar energy for computing $MEA_{\text{solar}}$ when the installed solar capacity is higher.

Locations of Wind and Solar Power Plants in Texas
(Source: EIA (2017f))

>

We have to run the regression in (3.1-1) for wind generation once more to extract the coefficients $\beta_{i,j}$ which we will then use when computing $\widehat{AEA}(K)$ for solar generation. This is again done with the self-written function beta_ij. Run the next code chunk.

#< task
# These lines load data.rds and call 'beta_ij()'.

data = readRDS("data.rds")
beta = beta_ij(data)
#>

Run the next code chunk to compute the average emissions avoided by solar energy for each of the three pollutants using our function aea(). Again, we will follow Novan's example and choose $0$, $1000$, $2000$ and $3000\,\text{MW}$ as capacity levels. The data for the joint distribution estimate $\hat f(x_{\text{solar}},D)$ is solar_cap_factor from the dataset data_solar.rds and, just like before, load. We are still using $40$ bins in each dimension, but since we use data for one year only, we have of course less data points than we had for wind power.

#< task
# Load 'data_solar.rds' and estimate AEA(K) for each pollutant using 'aea()'.

data_solar = readRDS("data_solar.rds")

aea_solar_co2 = aea((0:3)*1000, data_solar$solar_cap_factor, data_solar$load, beta$co2)
aea_solar_nox = aea((0:3)*1000, data_solar$solar_cap_factor, data_solar$load, beta$nox)
aea_solar_so2 = aea((0:3)*1000, data_solar$solar_cap_factor, data_solar$load, beta$so2)
#>

< quiz "AEA for Solar Energy"

question: How do you expect AEA(K) to change when the level of installed solar capacity increases? For your answer, think about at which levels of demand PV panels produce most (we explored this in exercise 6.1). Then recall how the MEA change for different levels of renewable production at these demand levels. Check the plots in exercise 3.2 again if you need help.

sc: - I expect it to be constant for all levels of solar capacity. - I expect it to increase with increasing levels of solar capacity. - I expect it to decrease with increasing levels of solar capacity.* success: Great, this is exactly what we will see when plotting the results. failure: This is not what we will see when plotting the results. Try again.

>


Run this code chunk to plot $\widehat{AEA}(K)$ for solar generation for each pollutant.

#< task_notest
# This code chunk plots the estimates for AEA(K) from solar power for each
# pollutant.

# The standard errors of AEA(K) are computed using the self-written function
# 'se_aea()'.
se_aea_solar = se_aea((0:3)*1000, data_solar$solar_cap_factor,
                      data_solar$load, data)

# Add two new columns to 'aea_solar_co2' containing the lower and upper bound
# for the error area.
library(dplyr)
co2 = mutate(aea_solar_co2, se_co2_min = AEA-2*se_aea_solar[,2],
             se_co2_max = AEA+2*se_aea_solar[,2])

# Create the plot for CO2. 'geom_ribbon()' displays an interval around
# the estimates of AEA(K) representing two times their standard error. 
library(ggplot2)
plot_co2 = ggplot(co2,aes(x=K, y=AEA)) +
  geom_line() +
  geom_ribbon(aes(ymin=se_co2_min, ymax=se_co2_max),alpha=0.3) +
  xlab("K [MW]") +
  ylab("AEA(K) for CO2 [tons/MWh]")

# Create the plot for NOx.
nox = mutate(aea_solar_nox, se_nox_min = AEA-2*se_aea_solar[,3],
             se_nox_max = AEA+2*se_aea_solar[,3])

plot_nox = ggplot(nox, aes(x=K, y=AEA)) +
  geom_line() +
  geom_ribbon(aes(ymin=se_nox_min, ymax=se_nox_max),alpha=0.3) +
  xlab("K [MW]") +
  ylab("AEA(K) for NOx [lbs/MWh]")

# Create the plot for SO2.
so2 = mutate(aea_solar_so2, se_so2_min = AEA-2*se_aea_solar[,4],
             se_so2_max = AEA+2*se_aea_solar[,4])

plot_so2 = ggplot(so2, aes(x=K, y=AEA)) +
  geom_line() +
  geom_ribbon(aes(ymin=se_so2_min, ymax=se_so2_max),alpha=0.3) +
  xlab("K [MW]") +
  ylab("AEA(K) for SO2 [lbs/MWh]") +
  scale_y_continuous(breaks = (2:9)*0.2)


# These lines load the package 'gridExtra' and arrange the three plots next to
# each other.
library(gridExtra)
grid.arrange(plot_co2, plot_nox, plot_so2, ncol = 3, 
             top="Average Emissions Avoided by K'th MW of Solar Capacity")
#>

Be aware when interpreting the results for solar power, that because the installed solar capacity at the end of 2012 was only $82\,\text{MW}$, these values are predictions of the impact of solar capacity. The graphs show that, unlike the average emissions avoided from wind capacity, the impact of solar production on emissions of all three pollutants is estimated to decrease with increasing installed capacity. This can be explained when going back to the plots for the marginal emissions avoided in exercise 3.2. These show that the offset emissions decrease with increasing marginal renewable production for high levels of demand. And hours with high levels of demand are exactly when solar PV panels tend to produce most electricity, as we learned in the last exercise. The first $\text{MW}$ of installed solar capacity is estimated to reduce on average more than $0.65\,\text{tons}$ of CO$_2$, $1.60\,\text{lbs}$ of NO$_x$ and about $1.05\,\text{lbs}$ of SO$_2$ per $\text{MWh}$. The impact is predicted to decrease to less than $0.60\,\text{tons/MWh}$ for CO$_2$, below $1.30\,\text{lbs/MWh}$ for NO$_x$ and about $0.75\,\text{lbs/MWh}$ for SO$_2$ from the $3,000$'th $\text{MW}$ of installed solar capacity. This corresponds to a drop by about $10\%$ for CO$_2$, $19\%$ for NO$_x$ and $29\%$ for SO$_2$.

To check whether this downward trend is predicted to persist even for higher levels of installed solar capacity, let's compute and plot $\widehat{AEA}(K)$ for levels up to $10,000\,\text{MW}$ of solar capacity. Run the code chunk below.

#< task_notest
# This code chunk plots the estimates for AEA(K) from solar power for each 
# pollutant up to capacity levels of 10,000 MW.

aea_solar_co2_high = aea((0:10)*1000, data_solar$solar_cap_factor, data_solar$load, beta$co2)
aea_solar_nox_high = aea((0:10)*1000, data_solar$solar_cap_factor, data_solar$load, beta$nox)
aea_solar_so2_high = aea((0:10)*1000, data_solar$solar_cap_factor, data_solar$load, beta$so2)

# The standard errors of AEA(K) are computed using the self-written function
# 'se_aea()'.
se_aea_solar_high = se_aea((0:10)*1000, data_solar$solar_cap_factor,
                           data_solar$load, data)

# Add two new columns to 'aea_solar_co2' containing the lower and upper bound
# for the error area.
library(dplyr)
co2_high = mutate(aea_solar_co2_high, se_co2_min = AEA-2*se_aea_solar_high[,2],
                  se_co2_max = AEA+2*se_aea_solar_high[,2])

# Create the plot for CO2. 'geom_ribbon()' displays an interval around
# the estimates of AEA(K) representing two times their standard error. 
library(ggplot2)
plot_co2_high = ggplot(co2_high,aes(x=K, y=AEA)) +
  geom_line() +
  geom_ribbon(aes(ymin=se_co2_min, ymax=se_co2_max),alpha=0.3) +
  xlab("K [MW]") +
  ylab("AEA(K) for CO2 [tons/MWh]")

# Create the plot for NOx.
nox_high = mutate(aea_solar_nox_high, se_nox_min = AEA-2*se_aea_solar_high[,3],
                  se_nox_max = AEA+2*se_aea_solar_high[,3])

plot_nox_high = ggplot(nox_high, aes(x=K, y=AEA)) +
  geom_line() +
  geom_ribbon(aes(ymin=se_nox_min, ymax=se_nox_max),alpha=0.3) +
  xlab("K [MW]") +
  ylab("AEA(K) for NOx [lbs/MWh]")

# Create the plot for SO2.
so2_high = mutate(aea_solar_so2_high, se_so2_min = AEA-2*se_aea_solar_high[,4],
                  se_so2_max = AEA+2*se_aea_solar_high[,4])

plot_so2_high = ggplot(so2_high, aes(x=K, y=AEA)) +
  geom_line() +
  geom_ribbon(aes(ymin=se_so2_min, ymax=se_so2_max),alpha=0.3) +
  xlab("K [MW]") +
  ylab("AEA(K) for SO2 [lbs/MWh]")

# These lines arrange the three plots next to each other.
grid.arrange(plot_co2_high, plot_nox_high, plot_so2_high, ncol = 3,
             top="Average Emissions Avoided by K'th MW of Solar Capacity for Higher Capacity Levels")
#>

The results predict that, once more than about $4,000\,\text{MW}$ of solar PV panels have been installed in Texas, each additional unit will on average offset successively more emissions per $\text{MWh}$ again. A reason for this could be that larger quantities of solar power are more likely to offset more of the base-load coal generation, which is accountable for the largest quantities of emitted pollutants. However, this development is predicted only for capacity levels that are well over the $556\,\text{MW}$ of installed ERCOT solar capacity at the end of 2016 (ERCOT (2017a)).

b) Social Benefits of Solar Power

Let's now turn to the social benefits from solar power. The next code chunk defines again the social costs for one ton of CO$_2$, NO$_x$ and SO$_2$ as described in exercise 5. Just click on check.

#< task
# These lines store the social costs for CO2, NOx and SO2.

sc_co2 = 32
sc_nox = 528.409
sc_so2 = 3194.46
#>

These values are now used to compute the external benefits for both scenarios: the lower-impact one, in which only CO$_2$ is offset by solar power and the higher-impact one, in which all three pollutants are offset. Just click check on the following two code chunks.

#< task
# Compute the social benefits from solar power for both scenarios.

aeb_solar_low = sc_co2*aea_solar_co2[,2]
aeb_solar_high = sc_co2*aea_solar_co2[,2] + sc_nox*aea_solar_nox[,2]/2000 +
  sc_so2*aea_solar_so2[,2]/2000
#>
#< task_notest
# This code chunk plots the estimates for the social benefits for both scenarios.

# These commands create a data frame with the capacity levels, the results for
# both scenarios and the lower and upper bound for their error areas.
aeb = data.frame(K = aea_solar_co2$K,
                 aeb_low = aeb_solar_low,
                 aeb_high = aeb_solar_high)

aeb_tot = mutate(aeb,
                 se_low_min = aeb_low - 2*se_aea_solar[,2]*sc_co2,
                 se_low_max = aeb_low + 2*se_aea_solar[,2]*sc_co2,
                 se_high_min = aeb_high -
                   2*(se_aea_solar[,2]*sc_co2+se_aea_solar[,3]*sc_nox/2000+
                        se_aea_solar[,4]*sc_so2/2000),
                 se_high_max = aeb_high +
                   2*(se_aea_solar[,2]*sc_co2+se_aea_solar[,3]*sc_nox/2000+
                        se_aea_solar[,4]*sc_so2/2000))

# These commands create the two plots.
plot_low = ggplot(aeb_tot, aes(x=K, y=aeb_low)) +
  geom_line() +
  geom_ribbon(aes(ymin=se_low_min, ymax=se_low_max), alpha=0.3) +
  xlab("K [MW]") +
  ylab("Social Benefit of Solar [$/MWh] - Only CO2 Offset") +
  scale_y_continuous(limits = c(16,30))

plot_high = ggplot(aeb_tot, aes(x=K, y=aeb_high)) +
  geom_line() +
  geom_ribbon(aes(ymin=se_high_min, ymax=se_high_max), alpha=0.3) +
  xlab("K [MW]") +
  ylab("Social Benefit of Solar [$/MWh] - CO2, NOx, SO2 Offset") +
  scale_y_continuous(limits = c(16,30))

# This command arranges the two plots next to each other for easier comparison.
grid.arrange(plot_low, plot_high, ncol = 2, 
             top="Average Social Benefit from K'th MW of Solar Capacity")
#>

< award "Master of Solar Power"

Great, you computed the average emissions avoided and the social benefit of solar power! We can now compare the results to those for wind power and use this as the basis for judging renewable energy policies.

>

Of course, the properties of $\widehat{AEA}(K)$ are mirrored again in the social benefits of avoiding pollution due to solar energy. The first unit of solar power has an estimated average social benefit of slightly more than \$$21/\text{MWh}$ for the lower-impact scenario and over \$$23/\text{MWh}$ for the, according to Novan more likely, higher-impact scenario. At $3,000\,\text{MW}$, this is predicted to drop to about \$$19/\text{MWh}$ if only CO$_2$ is avoided and about \$$21/\text{MWh}$ if all three pollutants are offset in response to solar capacity. That is a decrease of about $10\%$ for the lower-impact scenario and $9\%$ for the higher-impact scenario.
Our results for solar power differ from those for wind power. While the two renewable technologies are estimated to provide nearly the same social benefits for very low levels of installed capacity, their estimated impact diverges initially as the capacities grow. The social benefits might converge again for capacities close to $10,000\,\text{MW}$, but the estimates for solar power in this range are very imprecise with the available data.

The results in this exercise reveal, that the social benefits of renewable energies are dependent on the renewable technology - we find heterogeneity also across different technologies.
In the next exercises, we will use our findings from the last few exercises to discuss different renewable energy policies.

Exercise 7 -- Implications for Renewable Energy Policies

In the previous exercises, we explored the effect of additional wind and solar capacity on pollution. Based on these results, we will now discuss the implications on renewable energy policies. First however, let's recapitulate the main points from our results.

We used hourly fluctuations in wind and solar generation to estimate their impact on emissions from fossil fuel power plants in Texas and Oklahoma. We were especially interested in exploring the social benefit of capacity additions. The most important finding is:

Renewable energy is a heterogeneous good!

We saw this in two ways:
1. The social benefits depend on the type of renewable technology. We compared wind and solar power and found that their impact on emissions is not the same. This can be explained by the fact that wind turbines tend to produce most electricity when the demand is low while solar photovoltaic panels reach high levels of generation when the demand for electricity is high as well.
2. Even when considering only one renewable technology, not every additional unit of capacity provides the same social benefit. It varies with the level of already installed capacity because this influences which types of power plants have to reduce their production in response to the renewable generation.

It is important to keep in mind that our results refer to short-run effects of renewable energy, as Novan points out. In the long run, the composition of the installed power plants might change due to an increasing share of renewable capacity in a market. Some conventional power plants might shut down altogether or investors might decide against building new ones. This would likely lead to altered effects of renewable energy on emissions.

While we concentrated on one market in our analysis, other studies also found heterogeneity across markets. Siler-Evans et al. (2012) and Graff Zivin et al. (2014) show that marginal emissions from power plants vary considerably for all three pollutants in different US electrical grids. And Banzhaf & Chupp (2012) find that the social costs from emitting NO$_x$ and SO$_2$ depend substantially on the US state in which they are emitted. This suggests that the social benefits from renewable energies are not the same across market.

What do our results mean for policies supporting renewable energies?
The main implication from our results is that a renewable energy policy has to incorporate the heterogeneities in renewable energies in order to entail renewable investments that bring optimal social benefits. Novan argues that this can be achieved by emission taxes or cap and trade programmes on pollutants. These ensure that renewable units producing at times with high marginal emissions and therefore higher electricity prices, will profit the most. This makes investments in these technologies more profitable than in technologies that produce heavily when marginal emissions are low.

How are renewable energies currently promoted or facilitated in Texas and the US in general?
There are several policies currently in effect both on the federal and the state level.

Business Energy Investment Tax Credit (ITC):
The ITC is a federal-level corporate tax credit programme (DSIRE (2017a)). It offers a $30\%$ tax rebate for investments in solar technologies, fuel cells and wind turbines and $10\%$ for geothermal facilities, microturbines and combined heat and power facilities. The ITC programme is being phased out during the next few years.

Renewable Electricity Production Tax Credit (PTC):
The PTC is another corporate tax credit programme on the federal level (DSIRE (2017b)). It offers a tax reduction of \$$0.023/\text{kWh}$ for renewable energy from wind, geothermal, biomass and solar facilities and \$$0.012/\text{kWh}$ for other eligible technologies which started construction before 2017. The reduction is applicable in the first ten years of production. Only facilities not using the ITC are eligible. The programme is being phased out starting 2017 by successively reducing the tax rebates.

Renewable Portfolio Standards (RPS):
There are RPS in 29 states, including Texas, and the District of Columbia (NCSL (2017)). These set goals for the amount or share of renewable energies in the state. The Texan RPS requires $10,000\,\text{MW}$ of renewable capacity by 2025, a goal which has already been met (DSIRE (2017c)). Each energy utility is assigned a requirement and earns Renewable Energy Credits (RECs) for producing energy from renewable technologies. These can be traded with utilities that do not meet their obligations. Currently utilities in Texas earn one REC for each $\text{MWh}$ of wind energy and two RECs for one $\text{MWh}$ from other renewable technologies including solar.

In exercise 5, we already encountered other programmes targeting the emission of pollutants directly: the Clean Air Interstate Rule (CAIR) and the Acid Rain Program (ARP). CAIR was replaced by the Cross-State Air Pollution Rule (CSAPR) in 2015, a programme targeting annual NO$_x$ and SO$_2$ emissions as well as ozone season NO$_x$ emissions similar to CAIR (EPA (2015)). These cap and trade programmes are exactly the type of programme which Novan considers most successful in incorporating heterogeneities in renewable energies. However, we already discussed that during the period we studied, both CAIR and ARP might not have been very effective because their caps were most likely not binding and the prices for their allowances were only a fraction of the estimated social costs of NO$_x$ and SO$_2$. And also more recently, the prices for ARP and CSAPR allowances were still low (EPA (2015)).

Although there are several different policies promoting renewable energies both on the federal and the state level in the US, they mostly neglect the heterogeneities that Novan and other researchers found in renewable energies. They do not take into account the level of already installed renewable capacity and the federal programmes apply the same scheme to all markets. The differences in programmes on the state level are not based on the spacial heterogeneity of renewables. Some subsidies differentiate between different renewable technologies, but only in a very general and somewhat arbitrary way. In the best case, they would promote technologies more which offer higher social benefits. Then they could have similar effects as pollution prices or cap and trade programmes (cf. section V.E in Novan's paper).

While the current renewable energy policies in the US have led to an increase in investments in renewable energies (as shown by Hitaj (2013) for wind power), they do not bring socially optimal investments. Novan points out that future research has to show how costly it is when renewable policies neglect the heterogeneity in renewable energies.

Exercise 8 -- Summary

Congratulations! You completed the problem set "Pollution Reductions from Renewable Energies: Social Benefits and Policy Implications".
You set off on an econometric journey through the paper "Valuing the Wind: Renewable Energy Policies and Air Pollution Avoided" by Kevin Novan. On your way, you explored econometric topics, learned to navigate through R and got valuable insights into the question: What impact do renewable energies have on the emissions from fossil fuel power plants?

What did we do in this problem set?
Exercise 1.1 presented the background needed to understand the rest of the analysis: electrical grids, the merit order and pollution from power plants. Exercise 1.2 illustrated the fluctuations in wind energy, which we exploited to do our estimations, and we learned about the energy mix in Texas. In doing so we also got to know several useful packages in R, especially dplyr and ggplot2, which were very helpful also in the rest of the problem set. Finally, in exercise 1.3, we explored the data used in the following exercises.
The second part addressed the theoretical foundation for linear regressions and developed Novan's first model. In exercise 2.1, we learned how to perform regressions using OLS in R by estimating a simple model, which only considered the effect of wind generation on pollution. Most importantly, we learned the basics of reading and interpreting regression results. Exercise 2.2 provided the theoretical background for CLR models and OLS. We extended the initial, simple model in exercise 2.3 by including control variables for the demand in Texas and Oklahoma and examined how this changed the estimated impact of wind power on pollution. Here we discussed the omitted variable bias, $R^2$ and the adjusted $R^2$. Exercise 2.4 dealt with seasonality in our data and showed how to deseasonalise the variables in our model using daily and hourly fixed effects. In exercise 2.5, we encountered the Newey-West variance-covariance matrix, which we used to treat autocorrelation in the residuals of our model in order to get robust standard errors. Finally, in exercise 2.6, we used the complete model to estimate the average impact of wind power on CO$_2$, NO$_x$ and SO$_2$ emissions from power plants, first only in Texas and then in Texas as well as Oklahoma.
In exercise 3.1, we turned away from the average impact of wind generation and focused instead on the emissions avoided by the marginal $\text{MWh}$ of wind power. We got to know Novan's second CLR model, which incorporated interaction terms between wind generation and demand and saw how to use the results in computing the marginal emissions avoided (MEA). Here we also learned how to use for-loops and write functions in R. Following this, we analysed the MEA in exercise 3.2 and explained the behaviour by taking a look at the marginal generation avoided for different types of fossil fuel-powered generators and their emissions.
Building on this, exercise 4 developed a model for the average emissions avoided (AEA), which describe the average effect of the marginal unit of wind capacity. Here we also learned how to estimate a joint distribution function using kernel density estimation.
In exercise 5, we computed and analysed the AEA for different levels of wind generation and based on this, estimated the social benefits.
The next part of this problem set focused on a second renewable technology: solar power. Exercise 6.1 explored the characteristics of solar power and highlighted differences to wind power. In exercise 6.2, we computed the AEA and social benefits of solar energy and compared them to the results for wind power.
Finally, in exercise 7, we discussed how renewable energy policies can best incorporate the findings from Novan's paper and related studies. We learned that existing policies do not take heterogeneities in renewable energies into account.

Thank you for solving the problem set.
I hope you gained new knowledge about R, econometrics and energy economics and can apply and build upon what you have learned in the future.

If you are interested in seeing what you accomplished in this problem set, run the next code chunk. It will show you all the awards you obtained.

#< task
# Run this chunk to show all of your awards.

awards()
#>

Exercise Bibliography

Banzhaf, H. S. & B. A. Chupp (2012): Fiscal Federalism and Interjurisdictional Externalities: New Results and an Application to US Air Pollution. Journal of Public Economics, 96(5-6), pp. 449-64.

Baum, C. F. (2017): KDENS2: Stata module to estimate bivariate kernel density. https://ideas.repec.org/c/boc/bocode/s448502.html. Accessed June 19, 2017.

Brockwell, P. J. & R. A. Davis (2002): Introduction to Time Series and Forecasting. 2nd Edition, New York: Springer-Verlag.

Burns, P. (2011): The R Inferno. http://www.burns-stat.com/pages/Tutor/R_inferno.pdf. Accessed September 6, 2017.

Davies, T. M. (2016): The Book of R: A First Course in Programming and Statistics. San Francisco: No Starch Press.

Database of State Incentives for Renewables & Efficiency (DSIRE) (2017a): Business Energy Investment Tax Credit (ITC). http://programs.dsireusa.org/system/program/detail/658. Accessed August 25, 2017.

----- (2017b): Renewable Electricity Production Tax Credit (PTC). http://programs.dsireusa.org/system/program/detail/734. Accessed August 25, 2017.

----- (2017c): Renewable Generation Requirement. http://programs.dsireusa.org/system/program/detail/182. Accessed August 25, 2017.

Dincer, I. & C. Zamfirescu (2014): Advanced Power Generation Systems. Amsterdam: Elsevier.

Electric Reliability Council of Texas (ERCOT) (2017a): Inside the Promise: 2016 State of the Grid Report. http://www.ercot.com/content/wcm/lists/114739/2016_StateoftheGridReport.pdf. Accessed May 10, 2017.

----- (2017b): Company Profile. http://www.ercot.com/about/profile. Accessed July 27, 2017.

----- (2017c): Maps. http://www.ercot.com/news/mediakit/maps. Accessed July 27, 2017.

Energy Information Administration (EIA) (2012): Emissions Allowance Prices for SO2 and NOx Remained Low in 2011. February 2, 2012. https://www.eia.gov/todayinenergy/detail.php?id=4830#. Accessed July 10, 2017.

----- (2017a): Keyword: Gross Generation. https://www.eia.gov/tools/glossary/. Accessed July 3, 2017.

----- (2017b): Keyword: Net Generation. https://www.eia.gov/tools/glossary/. Accessed July 3, 2017.

----- (2017c): Keyword: Combined Cycle. https://www.eia.gov/tools/glossary/. Accessed September 6, 2017.

----- (2017d): Keyword: Heat Rate. https://www.eia.gov/tools/glossary/. Accessed September 6, 2017.

----- (2017e): Keyword: Thermal Efficiency. https://www.eia.gov/tools/glossary/. Accessed September 6, 2017.

----- (2017f): Texas: State Profile and Energy Estimates: Profile Overview. https://www.eia.gov/state/?sid=TX. Accessed July 13, 2017.

Environmental Protection Agency (EPA) (2011): Progress Report 2011: Clean Air Interstate Rule, Acid Rain Program, and Former NOx Budget Trading Program: SO2 and NOx Emissions, Compliance, and Market Analyses Report. https://www.epa.gov/sites/production/files/2015-08/documents/arpcair11_analyses_0.pdf. Accessed July 10, 2017.

----- (2014): 2014 Program Progress: Clean Air Interstate Rule, Acid Rain Program, and Former NOx Budget Trading Program. https://www3.epa.gov/airmarkets/progress/reports/pdfs/2014_full_report.pdf. Accessed July 10, 2017.

----- (2015): 2015 Program Progress: Cross-State Air Pollution Rule and Acid Rain Program. https://www3.epa.gov/airmarkets/progress/reports/pdfs/2015_full_report.pdf. Accessed September 10, 2017.

Frazee, A. (2014): Let's talk about vectorization. http://alyssafrazee.com/2014/01/29/vectorization.html. Accessed August 25, 2017.

Graff Zivin, J. S., M. J. Kotchen & E. T. Mansurc (2014): Spatial and Temporal Heterogeneity of Marginal Emissions: Implications for Electric Cars and Other Electricity-Shifting Policies. Journal of Economic Behavior and Organization, 107(Pt. A), pp. 248-268.

Greene, W. H. (2012): Econometric Analysis. 7th Edition, Harlow: Pearson Education.

Hau, E. (2014): Windkraftanlagen: Grundlagen, Technik, Einsatz, Wirtschaftlichkeit. 5th Edition, Berlin: Springer Vieweg.

Hitaj, C. (2013): Wind power development in the United States. Journal of Environmental Economics and Management, 65(3), pp. 394-410.

Hollander, M., D. A. Wolfe & E. Chicken (2014): Nonparametric Statistical Methods. 3rd Edition, Hoboken, NJ: John Wiley & Sons.

Interagency Working Group (IAWG) on Social Cost of Carbon (2013): Technical Update of the Social Cost of Carbon for Regulatory Impact Analysis Under Executive Order 12866. https://www.itreetools.org/eco/resources/social_cost_of_carbon_for_ria_2013_update.pdf. Accessed June 13, 2017.

Jaeger, W. K. (2005): Environmental Economics for Tree Huggers and Other Skeptics. Washington, DC: Island Press.

Kennedy, P. (2008): A Guide to Econometrics. 6th Edition, Malden, MA: Blackwell Publishing.

Lucy, D. & R. Aykroyd (2015): Package 'GenKern': Functions for Generating and Manipulating Binned Kernel Density Estimates. https://cran.r-project.org/web/packages/GenKern/GenKern.pdf. Accessed June 20, 2017.

Muller, N. Z. & R. Mendelsohn (2009): Efficient Pollution Regulation: Getting the Prices Right. American Economic Review, 99(5), pp. 1714-1739.

National Conference of State Legislatures (NCSL) (2017): State Renewable Portfolio Standards and Goals. http://www.ncsl.org/research/energy/renewable-portfolio-standards.aspx. Accessed August 25, 2017.

Newey, W. K. & K. D. West (1987): A Simple, Positive Semi-Definite, Heteroskedasticity and Autocorrelation Consistent Covariance Matrix. Econometrica, 55(3), pp. 703-708.

Nordhaus, W. D. (2016): Revisiting the Social Cost of Carbon. Proceedings of the National Academy of Sciences of the United States of America (PNAS), 114(7), pp. 1518-1523.

Novan, K. (2015a): Valuing the Wind: Renewable Energy Policies and Air Pollution Avoided. American Economic Journal: Economic Policy, 7(3), pp. 291-326.

----- (2015b): Valuing the Wind: Renewable Energy Policies and Air Pollution Avoided: Dataset. American Economic Journal: Economic Policy. http://dx.doi.org/10.1257/pol.20130268.

Regulatory Assistance Project (RAP) (2010): Research Brief: Emissions Performance Standards In Selected States. https://www.raponline.org/wp-content/uploads/2016/05/rap-researchbrief-simpson-eps-updated-2010-08-12-2-.pdf. Accessed July 10, 2017.

Rhodes, J. D., M. E. Webber, T. Deetjen & T. Davidson (2017): Are solar and wind really killing coal, nuclear and grid reliability?. https://theconversation.com/are-solar-and-wind-really-killing-coal-nuclear-and-grid-reliability-76741. Accessed September 10, 2017.

Ross, N. (2014): Vectorization in R: Why?. http://www.noamross.net/blog/2014/4/16/vectorization-in-r--why.html. Accessed August 25, 2017.

Schellong, W. (2016): Analyse und Optimierung von Energieverbundsystemen. Berlin: Springer Vieweg.

Scott, D. W. (2015): Multivariate Density Estimation: Theory, Practice, and Visualization. 1st Edition, Hoboken, NJ: John Wiley & Sons.

Siler-Evans, K., I. L. Azevedo & M. G. Morgan (2012): Marginal Emissions Factors for the U.S. Electricity System. Environmental Science and Technology, 46(9), pp. 4742-4748.

Stock, J. H. & M. W. Watson (2007): Introduction to Econometrics. 2nd Edition, Boston: Pearson Education.

Tol, R. S. J. (2008): The Social Cost of Carbon: Trends, Outliers and Catastrophes. Economics: The Open-Access, Open-Assessment E-Journal, 2(2008-25). http://dx.doi.org/10.5018/economics-ejournal.ja.2008-25

Venables, W. N., D. M. Smith & the R Core Team (2017): An Introduction to R: Notes on R: A Programming Environment for Data Analysis and Graphics, Version 3.4.0 (2017-04-21). https://cran.r-project.org/doc/manuals/r-release/R-intro.pdf. Accessed May 8, 2017.

Wan, Y. (2005): A Primer on Wind Power for Utility Applications. National Renewable Energy Laboratory Technical Report NREL/TP-500-36230. http://www.nrel.gov/docs/fy06osti/36230.pdf. Accessed May 10, 2017.

WINDExchange (2017): Installed Wind Capacity. https://apps2.eere.energy.gov/wind/windexchange/wind_installed_capacity.asp. Accessed July 27, 2017.

Wooldridge, J.M. (2013): Introductory Econometrics: A Modern Approach. 5th Edition, Mason, OH: South Western, Cengage Learning.

Zhu, J. (2015): Optimization of Power System Operation. 2nd Edition, Hoboken, NJ: John Wiley & Sons.

R and R packages

R: The R Foundation (2017): The R Project for Statistical Computing. R version 3.3.1. https://www.r-project.org/.

dplyr: Wickham, H. & R. Francois (2016): Package 'dplyr': A Grammar of Data Manipulation. R package version 0.5.0. https://CRAN.R-project.org/package=dplyr.

GenKern: Lucy, D. & R. Aykroyd (2013): Package 'GenKern': Functions for Generating and Manipulating Binned Kernel Density Estimates. R package version 1.2-60. https://CRAN.R-project.org/package=GenKern.

ggplot2: Wickham, H. & W. Chang (2016): Package 'ggplot2': Create Elegant Data Visualisations Using the Grammar of Graphics. R package version 2.2.1. https://CRAN.R-project.org/package=ggplot2.

gridExtra: Auguie, B. (2016): Package 'gridExtra': Miscellaneous Functions for "Grid" Graphics. R package version 2.2.1. https://CRAN.R-project.org/package=gridExtra.

lemon: Edwards, S. M. (2017): Package 'lemon': Freshing Up your 'ggplot2' Plots. R package version 0.3.0. https://CRAN.R-project.org/package=lemon.

lfe: Gaure, S. (2016): Package 'lfe': Linear Group Fixed Effects. R package version 2.5-1998. https://cran.r-project.org/web/packages/lfe/index.html.

lubridate: Grolemund, G., V. Spinu & H. Wickham (2016): Package 'lubridate': Make Dealing with Dates a Little Easier. R package version 1.6.0. https://CRAN.R-project.org/package=lubridate.

microbenchmark: Mersmann, O. (2015): Package 'microbenchmark': Accurate Timing Functions. R package version 1.4-2.1. https://CRAN.R-project.org/package=microbenchmark.

reshape2: Wickham, H. (2016): Package 'reshape2': Flexibly Reshape Data: A Reboot of the Reshape Package. R package version 1.4.1. https://CRAN.R-project.org/package=reshape2.

RTutor: Kranz, S. (2015): Package 'RTutor': Creating R exercises with automatic assement of student's solutions. R package version 2015.12.16. https://github.com/skranz/RTutor.

sandwich: Zeileis, A. (2015): Package 'sandwich': Robust Covariance Matrix Estimators. R package version 2.3-4. https://CRAN.R-project.org/package=sandwich.

stargazer: Hlavac, M. (2015): 'Package 'stargazer': Well-Formatted Regression and Summary Statistics Tables. R package version 5.2. https://CRAN.R-project.org/package=stargazer.



asbara/RTutorPollutionReductions documentation built on May 5, 2019, 1:37 p.m.